Atomic vs pre-Aggregated data

derek.ritz · August 21, 2014, 8:40pm

I’m starting this as a new thread, even tho many of the ideas have been introduced/explored in the immediately previous “HMIS messages” thread. The line of thinking I’m proposing here follows on from an email conversation I had with Bob and Jim earlier today and a phone conversation with Paul, yesterday. These, and other ad hoc conversations I’ve had lately, have focused on the important role DHIS2 can play as an analytics engine for OpenHIE
(and, frankly, in any implementation that “sets up” to use it that way whether
they’re leveraging OpenHIE or not…).

Basically, I would like to advocate for the following:

We set up a de-identification (obfuscation)
routine that masks the person-centric demographic information. I would suggest that this would be
algorithmic and, under governed circumstances, the algorithm could be run “in
reverse” to re-establish the demographic link. Such a technique ensures patient
privacy while still supporting the use of the analytic engine in surveillance
cases where, for example, an outbreak is detected and it is important to start
to narrow down to a “patient zero”.
eHealth transactions written to the OpenHIE SHR
are obfuscated and also written (twinned) to DHIS2 via the Patient Tracker interface. I can see no reason why this couldn’t be done on a transactional basis (as opposed to a nightly ETL, for example).
In whatever timeframe is appropriate, at
whatever reporting level is appropriate, indicators and metrics may be
generated using DHIS2’s reporting and graphical presentation capabilities.
Longitudinal analyses may be supported (since we
will have obfuscated, but uniquely ID’d, patient-level data); this opens up
powerful opportunities for patient-centred outcomes research.
Regarding research, the opportunities for the
MOH to grant access to the de-identified database to support academic
explorations would be significantly strengthened as the patient privacy issues
– while not zero – are largely addressed.

I don’t number these points so much to infer they are sequential as to give me a way to reference them. For example, point 1 suggests we create a de-identified, patient privacy-safe dataset out of our identified, person-centric transactional data streams. My suggestion to allow the obfuscation algorithm to be 2-way is simply informed by Canada’s experience during the SARS crisis. After SARS, the value of governed, 2-way algorithms was recognized by the MOH because of how important it was to be able to re-identify individuals so that public health interventions could react to signals from the surveillance system. We could, of course, use irreversible techniques… that’s just not what is now favoured in Canada for the reasons I’ve cited. (ref: https://www.infoway-inforoute.ca/index.php/component/docman/doc_download/624-tools-for-de-identification-of-personal-health-information)

Point 2 starts to get to the heart of Carl’s questions regarding point-in-time time vs. aggregates per time period. Quite frankly, I think we will be far better served to pass ATOMIC, timestamped data to DHIS2 and let it do its own aggregations. I know that there is an interface (DXF2) for pushing summary
indicators to DHIS2. My sense, however, is that there is a MUCH stronger and
more attractive opportunity to leverage DHIS2 as the analysis engine (not just indicators repository) for
OpenHIE and that this opportunity will rely on bringing in the more atomic data. I
believe the Patient Tracker interface is the right way to do this. Today, for example, DHIS2 can generate its own metrics/indicators from the atomic-level data it gets from Tracker. We should, in my view, make use of this existing capability (this is point #3). I would even advocate for using this atomic data concept for the HWR data feeds discussed on the previous thread.

Regarding point #4 – this is, in my view, perhaps the single greatest M&E opportunity. One of the things I’m quite actively doing in my professional practice is helping MOHs get their arms around how data will be employed to support decision-making. For many care programmes (maternal, HIV, TB, NCDs, etc.), what we need to be able to analyse, to support management and decision-making, is the patient trajectory and how our interventions impact the ongoing care of that individual. The statistical aggregating of these multiple trajectories helps us answer questions like “how many women attended all 4 ANC visits – and were their labour/delivery outcomes any better compared to those who did not attend all 4?” and “what was the impact of SMS medication reminders on loss to follow-up?” and such. These, and other explorations, are at the heart of point #5. I believe patient-centred outcomes research hold huge promise and we should expect OpenHIE’s analytics engine to be able to support it.

As Bob has correctly pointed out, DHIS2 is used in all sorts of contexts. OpenHIE, however, does not need to pretend it is a “lowest common denominator”; it is a data rich and architecturally precise context. Within the context of OpenHIE, why would we even consider pre-aggregating content on a periodic basis when we could, instead, give DHIS2 the phenomenally more valuable atomic data to chew on?

Just a thought…

Derek.

PS: There was quite a bit of talk in the previous thread about facilities and the need to manage them in hierarchies. I must admit – I think this issue is being over-thought. The key, in my view, is to unambiguously identify the facility. Period. Whatever hierarchies this facility is a member of is a completely separate and pretty much arbitrary construct which can be managed in tandem. A facility may operate within a management hierarchy and within a geographic hierarchy and be a member of multiple arbitrary subgroups (the group of maternal care facilities; the group of FBO-operated facilities; the group of facilities connected to the HIE; the group of facilities participating in a particular RCT-based pilot). These attributes of a facility, some of which are purely to support arbitrary reporting roll-ups, can be fluid over time and should be managed as such. We’re far better off, in my opinion, to approach this as a relational problem rather than as an hierarchical one. My $0.02…

Justin_Fyfe · August 25, 2014, 2:44pm

Hi Everyone,

I just joined the group so I apologize if this has already been discussed. Just wanted to add that IHE published the De-Identification whitepaper which provides great guidance on when/how to de-identify data as well as some of the caveats/considerations (http://ihe.net/uploadedFiles/Documents/ITI/IHE_ITI_Handbook_De-Identification_Rev1.1_2014-06-06.pdf) there is also a good spreadsheet of algorithms and their applicability to types of data (http://ihe.net/uploadedFiles/Documents/ITI/IHE_ITI_Handbook_De-Identification-Mapping_Rev1.1_2014-06-06.xlsx).

We used a draft of these documents a basis to develop a de-identified data warehouse (discrete data) which could produce either aggregate data or discrete data for a research study into stroke recovery at McMaster University.

Cheers

-Justin

···

On Thursday, August 21, 2014 4:40:11 PM UTC-4, Derek Ritz wrote:

I’m starting this as a new thread, even tho many of the ideas have been introduced/explored in the immediately previous “HMIS messages” thread. The line of thinking I’m proposing here follows on from an email conversation I had with Bob and Jim earlier today and a phone conversation with Paul, yesterday. These, and other ad hoc conversations I’ve had lately, have focused on the important role DHIS2 can play as an analytics engine for OpenHIE
(and, frankly, in any implementation that “sets up” to use it that way whether
they’re leveraging OpenHIE or not…).

Basically, I would like to advocate for the following:

We set up a de-identification (obfuscation)
routine that masks the person-centric demographic information. I would suggest that this would be
algorithmic and, under governed circumstances, the algorithm could be run “in
reverse” to re-establish the demographic link. Such a technique ensures patient
privacy while still supporting the use of the analytic engine in surveillance
cases where, for example, an outbreak is detected and it is important to start
to narrow down to a “patient zero”.

eHealth transactions written to the OpenHIE SHR
are obfuscated and also written (twinned) to DHIS2 via the Patient Tracker interface. I can see no reason why this couldn’t be done on a transactional basis (as opposed to a nightly ETL, for example).

In whatever timeframe is appropriate, at
whatever reporting level is appropriate, indicators and metrics may be
generated using DHIS2’s reporting and graphical presentation capabilities.

Longitudinal analyses may be supported (since we
will have obfuscated, but uniquely ID’d, patient-level data); this opens up
powerful opportunities for patient-centred outcomes research.

Regarding research, the opportunities for the
MOH to grant access to the de-identified database to support academic
explorations would be significantly strengthened as the patient privacy issues
– while not zero – are largely addressed.

I don’t number these points so much to infer they are sequential as to give me a way to reference them. For example, point 1 suggests we create a de-identified, patient privacy-safe dataset out of our identified, person-centric transactional data streams. My suggestion to allow the obfuscation algorithm to be 2-way is simply informed by Canada’s experience during the SARS crisis. After SARS, the value of governed, 2-way algorithms was recognized by the MOH because of how important it was to be able to re-identify individuals so that public health interventions could react to signals from the surveillance system. We could, of course, use irreversible techniques… that’s just not what is now favoured in Canada for the reasons I’ve cited. (ref: https://www.infoway-inforoute.ca/index.php/component/docman/doc_download/624-tools-for-de-identification-of-personal-health-information)

Point 2 starts to get to the heart of Carl’s questions regarding point-in-time time vs. aggregates per time period. Quite frankly, I think we will be far better served to pass ATOMIC, timestamped data to DHIS2 and let it do its own aggregations. I know that there is an interface (DXF2) for pushing summary
indicators to DHIS2. My sense, however, is that there is a MUCH stronger and
more attractive opportunity to leverage DHIS2 as the analysis engine (not just indicators repository) for
OpenHIE and that this opportunity will rely on bringing in the more atomic data. I
believe the Patient Tracker interface is the right way to do this. Today, for example, DHIS2 can generate its own metrics/indicators from the atomic-level data it gets from Tracker. We should, in my view, make use of this existing capability (this is point #3). I would even advocate for using this atomic data concept for the HWR data feeds discussed on the previous thread.

Regarding point #4 – this is, in my view, perhaps the single greatest M&E opportunity. One of the things I’m quite actively doing in my professional practice is helping MOHs get their arms around how data will be employed to support decision-making. For many care programmes (maternal, HIV, TB, NCDs, etc.), what we need to be able to analyse, to support management and decision-making, is the patient trajectory and how our interventions impact the ongoing care of that individual. The statistical aggregating of these multiple trajectories helps us answer questions like “how many women attended all 4 ANC visits – and were their labour/delivery outcomes any better compared to those who did not attend all 4?” and “what was the impact of SMS medication reminders on loss to follow-up?” and such. These, and other explorations, are at the heart of point #5. I believe patient-centred outcomes research hold huge promise and we should expect OpenHIE’s analytics engine to be able to support it.

As Bob has correctly pointed out, DHIS2 is used in all sorts of contexts. OpenHIE, however, does not need to pretend it is a “lowest common denominator”; it is a data rich and architecturally precise context. Within the context of OpenHIE, why would we even consider pre-aggregating content on a periodic basis when we could, instead, give DHIS2 the phenomenally more valuable atomic data to chew on?

Just a thought…

Derek.

PS: There was quite a bit of talk in the previous thread about facilities and the need to manage them in hierarchies. I must admit – I think this issue is being over-thought. The key, in my view, is to unambiguously identify the facility. Period. Whatever hierarchies this facility is a member of is a completely separate and pretty much arbitrary construct which can be managed in tandem. A facility may operate within a management hierarchy and within a geographic hierarchy and be a member of multiple arbitrary subgroups (the group of maternal care facilities; the group of FBO-operated facilities; the group of facilities connected to the HIE; the group of facilities participating in a particular RCT-based pilot). These attributes of a facility, some of which are purely to support arbitrary reporting roll-ups, can be fluid over time and should be managed as such. We’re far better off, in my opinion, to approach this as a relational problem rather than as an hierarchical one. My $0.02…