Atomic vs pre-Aggregated data

Roger_Friedman · August 22, 2014, 10:39pm

Derek –

First, I think that obfuscation in the SHR is a really bad idea as it makes it much harder to link records across facilities, which is a major
purpose of the SHR. Nor do you consider the privacy interests of the physician, who just might not want people to know how many of his surgical patients died on the table. Maybe you could do it by ending up with a lot of PI in the patient registry and you
have an encounter finder that includes patient, provider and facility, but then you are not really de-identifying the data, you are just moving around the deck chairs. The enemy of privacy remains the person with a necklace of SecureID tokens.

Second, you can’t do obfuscation on a record-by-record basis, it doesn’t allow you to implement small-cell-value rules. For example, if one
patient died of Ebola, you can’t report the case without making the patient identifiable. Also you can’t really do geographic obfuscation when the encounter is with a village health promoter. However, as you consolidate data upward through the hierarchy,
your small cell count problems disappear and your geographic area for obfuscation increases. Note that this problem applies to aggregated data also, so you must treat aggregated data as containing PI and only display values after censoring. That doesn’t
mean you can’t use the data in a PI-protective environment.

Third, the idea of obfuscated data is similar to the microsample concept used in Zambia – the records were identified only to the extent permitted
by obfuscation (and I believe the records contained population weights, but I’m not sure about that). This concept was not well accepted by the MOH.

Fourth, the idea of receiving aggregated data from a provider is well accepted. I just watched a presentation yesterday afternoon about the
NYC system, which is capable of downloading queries producing aggregate data to providers’ EHRs (as long as they use e-clinical works) and receiving answers back typically overnight. At the moment, we are still struggling to provide monthly data from EHRs
to DHIS, and the idea of calculating this data from the SHR has not really been tried. The patient tracker interface in DHIS2 was not really appropriate for collecting or transmitting PI the last time I checked.

Fifth, I don’t think the idea of academic use of this data bears much weight. To be useful for research, or even for epidemiology or M&E, there
needs to be collection of non-clinical data and there needs to be careful curation of the data beyond the needs of the clinician. The human and financial cost of data gathering is incurred by the facility, and the gathering of data not useful to the clinic
guarantees data quality problems.

Sixth, regarding Carl’s point, this arises from a limitation of DHIS2, which requires all data collection to be periodic. There is nothing
that stops the data from being stored in the data warehouse except the requirement of periodicity and nothing that stops the data from being aggregated or interpolated but for the lack of aggregation or interpolation rules for aperiodic data. The big advantage
of DHIS’ requirement of a time period is that it permits automatic identification of failures to report.

Hope you are doing well and that you soon start a new streak of good ideas.

Saludos, Roger

···

I’m starting this as a new thread, even tho many of the ideas have been introduced/explored in the immediately previous “HMIS messages” thread. The line of thinking I’m proposing
here follows on from an email conversation I had with Bob and Jim earlier today and a phone conversation with Paul, yesterday. These, and other ad hoc conversations I’ve had lately, have focused on the important role DHIS2 can play as an analytics engine for
OpenHIE (and, frankly, in any implementation that “sets up” to use it that way whether they’re leveraging OpenHIE or not…).

Basically, I would like to advocate for the following:

We set up a de-identification (obfuscation) routine that masks the person-centric demographic information. I would suggest that this would be algorithmic and, under governed circumstances, the
algorithm could be run “in reverse” to re-establish the demographic link. Such a technique ensures patient privacy while still supporting the use of the analytic engine in surveillance cases where, for example, an outbreak is detected and it is important to
start to narrow down to a “patient zero”.
eHealth transactions written to the OpenHIE SHR are obfuscated and also written (twinned) to DHIS2 via the Patient Tracker interface. I can see no reason why this couldn’t be done on a transactional
basis (as opposed to a nightly ETL, for example).
In whatever timeframe is appropriate, at whatever reporting level is appropriate, indicators and metrics may be generated using DHIS2’s reporting and graphical presentation capabilities.
Longitudinal analyses may be supported (since we will have obfuscated, but uniquely ID’d, patient-level data); this opens up powerful opportunities for patient-centred outcomes research.
Regarding research, the opportunities for the MOH to grant access to the de-identified database to support academic explorations would be significantly strengthened as the patient privacy issues
– while not zero – are largely addressed.

I don’t number these points so much to infer they are sequential as to give me a way to reference them. For example,
point 1 suggests we create a de-identified, patient privacy-safe dataset out of our identified, person-centric transactional data streams. My suggestion to allow the obfuscation algorithm to be 2-way is simply informed by Canada’s experience during the SARS
crisis. After SARS, the value of governed, 2-way algorithms was recognized by the MOH because of how important it was to be able to re-identify individuals so that public health interventions could react to signals from the surveillance system. We could, of
course, use irreversible techniques… that’s just not what is now favoured in Canada for the reasons I’ve cited. (ref: https://www.infoway-inforoute.ca/index.php/component/docman/doc_download/624-tools-for-de-identification-of-personal-health-information)

Point 2 starts to get to the heart of Carl’s questions regarding point-in-time time vs. aggregates per time period. Quite frankly,
I think we will be far better served to pass ATOMIC, timestamped data to DHIS2 and let it do its own aggregations. I know that there is an interface (DXF2) for pushing summary indicators to DHIS2. My sense, however, is that there is a MUCH stronger and more
attractive opportunity to leverage DHIS2 as the analysis engine (not just indicators repository) for OpenHIE and that this opportunity will rely on bringing in the more atomic data. I believe the Patient Tracker interface is the right way to do this. Today,
for example, DHIS2 can generate its own metrics/indicators from the atomic-level data it gets from Tracker. We should, in my view, make use of this existing capability (this is point #3). I would even advocate for using this atomic data concept for the HWR
data feeds discussed on the previous thread.

Regarding point #4 – this is, in my view, perhaps the single greatest M&E opportunity. One of the things I’m quite actively doing in my professional practice is helping MOHs get
their arms around how data will be employed to support decision-making. For many care programmes (maternal, HIV, TB, NCDs, etc.), what we need to be able to analyse, to support management and decision-making, is the patient trajectory and how our interventions
impact the ongoing care of that individual. The statistical aggregating of these multiple trajectories helps us answer questions like “how many women attended all 4 ANC visits – and were their labour/delivery outcomes any better compared to those who did
not attend all 4?” and “what was the impact of SMS medication reminders on loss to follow-up?” and such. These, and other explorations, are at the heart of point #5. I believe patient-centred outcomes research hold huge promise and we should expect OpenHIE’s
analytics engine to be able to support it.

As Bob has correctly pointed out, DHIS2 is used in all sorts of contexts. OpenHIE, however, does not need to pretend it is a “lowest common denominator”; it is a data rich and architecturally
precise context. Within the context of OpenHIE, why would we even consider pre-aggregating content on a periodic basis when we could, instead, give DHIS2 the phenomenally more valuable atomic data to chew on?

Just a thought…

Derek.

PS: There was quite a bit of talk in the previous thread about facilities and the need to manage them in hierarchies. I must admit – I think this issue is being over-thought. The
key, in my view, is to unambiguously identify the facility. Period. Whatever hierarchies this facility is a member of is a completely separate and pretty much arbitrary construct which can be managed in tandem. A facility may operate within a management hierarchy
and within a geographic hierarchy and be a member of multiple arbitrary subgroups (the group of maternal care facilities; the group of FBO-operated facilities; the group of facilities connected to the HIE; the group of facilities participating in a particular
RCT-based pilot). These attributes of a facility, some of which are purely to support arbitrary reporting roll-ups, can be fluid over time and should be managed as such. We’re far better off, in my opinion, to approach this as a relational problem rather than
as an hierarchical one. My $0.02…

–
You received this message because you are subscribed to the Google Groups “Open HMIS” group.
To unsubscribe from this group and stop receiving emails from it, send an email to
open-hmis+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

derek.ritz · August 22, 2014, 11:15pm

Hi Roger.

Thanks for the detailed response. I apologize – I think I should have been more precise – the obfuscation that I’m recommending is pseudonymisation. I used the term “obfuscation” because I thought it would be better understood but I fear what I’ve inadvertently done is made it sound like I was proposing something “new”. I’m not. As was described in the document linked in my original post – this is a very mature technique and the recommended procedures for employing it address the issues you have raised. This linked document, which I feel is quite good, speaks to your points 1, 2 and 3.

RE: the patient tracker interface – it is, today, used to track person-centric information directly within DHIS2. My understanding (Bob, Jim… a little help here, please?) is that DHIS2 is very able to generate the reports that it needs to report without requiring the data that was originally submitted via the Patient Tracker interface to be independently aggregated and re-submitted via the DXF interface. This speaks, I think, to your points 4 and 6.

Regarding the academic research – point #5 – these may actually be carried by groups inside the MOH (we have, in my Canadian province, an embedded research institute – ICES http://www.ices.on.ca/ – which carries out analytics using the province’s data holdings). Like I said, the barriers are lower for sharing a de-identified data set outside the MOH… but they are not zero. I personally believe such research to be useful and potentially quite important. Others feel that way, too. The US, for example, has applied about 10% of its ~$35 billion national eHealth budget to support patient-centred outcomes research (http://www.pcori.org/) and has very high expectations regarding the contribution this can make to improving care delivery.

I’m well and hope you are too, my friend. Thank you, again, for keeping the conversation active and engaging.

Warmest regards,

Derek.

···

On Friday, August 22, 2014 6:40:01 PM UTC-4, Roger Friedman wrote:

Sent from my Samsung Epic™ 4G Touch

-------- Original message --------
Subject:RE: Atomic vs pre-Aggregated data
From:“Friedman, Roger (CDC/ONDIEH/NCCDPHP) (CTR)” rd...@cdc.gov
To:r.f...@mindspring.com
Cc:

Derek –

             First, I think that obfuscation in the SHR is a really bad idea as it makes it much harder to link records across facilities, which is a major
purpose of the SHR. Nor do you consider the privacy interests of the physician, who just might not want people to know how many of his surgical patients died on the table. Maybe you could do it by ending up with a lot of PI in the patient registry and you
have an encounter finder that includes patient, provider and facility, but then you are not really de-identifying the data, you are just moving around the deck chairs. The enemy of privacy remains the person with a necklace of SecureID tokens.

             Second, you can’t do obfuscation on a record-by-record basis, it doesn’t allow you to implement small-cell-value rules.  For example, if one
patient died of Ebola, you can’t report the case without making the patient identifiable. Also you can’t really do geographic obfuscation when the encounter is with a village health promoter. However, as you consolidate data upward through the hierarchy,
your small cell count problems disappear and your geographic area for obfuscation increases. Note that this problem applies to aggregated data also, so you must treat aggregated data as containing PI and only display values after censoring. That doesn’t
mean you can’t use the data in a PI-protective environment.

             Third, the idea of obfuscated data is similar to the microsample concept used in Zambia – the records were identified only to the extent permitted
by obfuscation (and I believe the records contained population weights, but I’m not sure about that). This concept was not well accepted by the MOH.

             Fourth, the idea of receiving aggregated data from a provider is well accepted.  I just watched a presentation yesterday afternoon about the
NYC system, which is capable of downloading queries producing aggregate data to providers’ EHRs (as long as they use e-clinical works) and receiving answers back typically overnight. At the moment, we are still struggling to provide monthly data from EHRs
to DHIS, and the idea of calculating this data from the SHR has not really been tried. The patient tracker interface in DHIS2 was not really appropriate for collecting or transmitting PI the last time I checked.

             Fifth, I don’t think the idea of academic use of this data bears much weight.  To be useful for research, or even for epidemiology or M&E, there
needs to be collection of non-clinical data and there needs to be careful curation of the data beyond the needs of the clinician. The human and financial cost of data gathering is incurred by the facility, and the gathering of data not useful to the clinic
guarantees data quality problems.

             Sixth, regarding Carl’s point, this arises from a limitation of DHIS2, which requires all data collection to be periodic.  There is nothing
that stops the data from being stored in the data warehouse except the requirement of periodicity and nothing that stops the data from being aggregated or interpolated but for the lack of aggregation or interpolation rules for aperiodic data. The big advantage
of DHIS’ requirement of a time period is that it permits automatic identification of failures to report.

            Hope you are doing well and that you soon start a new streak of good ideas.

Saludos, Roger

-------- Original message --------
Subject:Atomic vs pre-Aggregated data
From:Derek Ritz derek...@gmail.com
To:ope...@googlegroups.com
Cc:

I’m starting this as a new thread, even tho many of the ideas have been introduced/explored in the immediately previous “HMIS messages” thread. The line of thinking I’m proposing
here follows on from an email conversation I had with Bob and Jim earlier today and a phone conversation with Paul, yesterday. These, and other ad hoc conversations I’ve had lately, have focused on the important role DHIS2 can play as an analytics engine for
OpenHIE (and, frankly, in any implementation that “sets up” to use it that way whether they’re leveraging OpenHIE or not…).

Basically, I would like to advocate for the following:

We set up a de-identification (obfuscation) routine that masks the person-centric demographic information. I would suggest that this would be algorithmic and, under governed circumstances, the
algorithm could be run “in reverse” to re-establish the demographic link. Such a technique ensures patient privacy while still supporting the use of the analytic engine in surveillance cases where, for example, an outbreak is detected and it is important to
start to narrow down to a “patient zero”.

eHealth transactions written to the OpenHIE SHR are obfuscated and also written (twinned) to DHIS2 via the Patient Tracker interface. I can see no reason why this couldn’t be done on a transactional
basis (as opposed to a nightly ETL, for example).

In whatever timeframe is appropriate, at whatever reporting level is appropriate, indicators and metrics may be generated using DHIS2’s reporting and graphical presentation capabilities.

Longitudinal analyses may be supported (since we will have obfuscated, but uniquely ID’d, patient-level data); this opens up powerful opportunities for patient-centred outcomes research.

Regarding research, the opportunities for the MOH to grant access to the de-identified database to support academic explorations would be significantly strengthened as the patient privacy issues
– while not zero – are largely addressed.
I don’t number these points so much to infer they are sequential as to give me a way to reference them. For example,
point 1 suggests we create a de-identified, patient privacy-safe dataset out of our identified, person-centric transactional data streams. My suggestion to allow the obfuscation algorithm to be 2-way is simply informed by Canada’s experience during the SARS
crisis. After SARS, the value of governed, 2-way algorithms was recognized by the MOH because of how important it was to be able to re-identify individuals so that public health interventions could react to signals from the surveillance system. We could, of
course, use irreversible techniques… that’s just not what is now favoured in Canada for the reasons I’ve cited. (ref: https://www.infoway-inforoute.ca/index.php/component/docman/doc_download/624-tools-for-de-identification-of-personal-health-information)

Point 2 starts to get to the heart of Carl’s questions regarding point-in-time time vs. aggregates per time period. Quite frankly,
I think we will be far better served to pass ATOMIC, timestamped data to DHIS2 and let it do its own aggregations. I know that there is an interface (DXF2) for pushing summary indicators to DHIS2. My sense, however, is that there is a MUCH stronger and more
attractive opportunity to leverage DHIS2 as the analysis engine (not just indicators repository) for OpenHIE and that this opportunity will rely on bringing in the more atomic data. I believe the Patient Tracker interface is the right way to do this. Today,
for example, DHIS2 can generate its own metrics/indicators from the atomic-level data it gets from Tracker. We should, in my view, make use of this existing capability (this is point #3). I would even advocate for using this atomic data concept for the HWR
data feeds discussed on the previous thread.

Regarding point #4 – this is, in my view, perhaps the single greatest M&E opportunity. One of the things I’m quite actively doing in my professional practice is helping MOHs get
their arms around how data will be employed to support decision-making. For many care programmes (maternal, HIV, TB, NCDs, etc.), what we need to be able to analyse, to support management and decision-making, is the patient trajectory and how our interventions
impact the ongoing care of that individual. The statistical aggregating of these multiple trajectories helps us answer questions like “how many women attended all 4 ANC visits – and were their labour/delivery outcomes any better compared to those who did
not attend all 4?” and “what was the impact of SMS medication reminders on loss to follow-up?” and such. These, and other explorations, are at the heart of point #5. I believe patient-centred outcomes research hold huge promise and we should expect OpenHIE’s
analytics engine to be able to support it.

As Bob has correctly pointed out, DHIS2 is used in all sorts of contexts. OpenHIE, however, does not need to pretend it is a “lowest common denominator”; it is a data rich and architecturally
precise context. Within the context of OpenHIE, why would we even consider pre-aggregating content on a periodic basis when we could, instead, give DHIS2 the phenomenally more valuable atomic data to chew on?

Just a thought…

Derek.

PS: There was quite a bit of talk in the previous thread about facilities and the need to manage them in hierarchies. I must admit – I think this issue is being over-thought. The
key, in my view, is to unambiguously identify the facility. Period. Whatever hierarchies this facility is a member of is a completely separate and pretty much arbitrary construct which can be managed in tandem. A facility may operate within a management hierarchy
and within a geographic hierarchy and be a member of multiple arbitrary subgroups (the group of maternal care facilities; the group of FBO-operated facilities; the group of facilities connected to the HIE; the group of facilities participating in a particular
RCT-based pilot). These attributes of a facility, some of which are purely to support arbitrary reporting roll-ups, can be fluid over time and should be managed as such. We’re far better off, in my opinion, to approach this as a relational problem rather than
as an hierarchical one. My $0.02…

–
You received this message because you are subscribed to the Google Groups “Open HMIS” group.
To unsubscribe from this group and stop receiving emails from it, send an email to
open-hmis+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.