meet about configuring OpenEMPI

This is from Odysseas regarding more specific configuration of probabilistic vs deterministic matching algorithms in OpenEMPI. Take a few minutes to consume this sometime before our next call, and perhaps we can toy with it once we have the sandboxes all talking to each other.

Jeremy Keiper
OpenMRS Core Developer
AMPATH / IU-Kenya Support

···

---------- Forwarded message ----------
From: Odysseas Pentakalos, Ph.D. odysseas@sysnetint.com

Date: Fri, Jun 14, 2013 at 9:47 AM
Subject: Re: meet about configuring OpenEMPI
To: Jeremy Keiper jeremy@openmrs.org
Cc: wentong@sysnetint.com

Hi Jeremy,

  Here is an overview of the installation/configuration process for

switching from the exact/deterministic matching algorithm to the
probabilistic one.

  1. The first step is to reconfigure your instance to load the

probabilistic module instead of the exact one during start-up.
There are two things that need to

  be done.

  a. First, you need to tell the system to load the probabilistic

module and you do that by configuring the
openempi-extensions-contexts.properties

  file in the conf directory. The contents of that file are shown

below. Notice that all you do is include in the comma separated
list of extension contexts

  the one for the probabilistic algorithm  and remove or comment out

the exact-matching one.

#openempi.extension.contexts=applicationContext-module-basic-blocking.xml,applicationContext-module-file-loader.xml,applicationContext-mo

  dule-exact-matching.xml

openempi.extension.contexts=applicationContext-module-basic-blocking.xml,applicationContext-module-file-loader.xml,applicationContext-mod

ule-probabilistic-matching.xml,applicationContext-module-basic-blocking-hp.xml

  b. Next, you need to adjust the configuration of OpenEMPI in

mpi-config.xml to provide configuration settings for the
probabilistic matching algorithm. The

  easiest way to do this is to take one of the sample

mpi-config*.xml file in the conf directory that includes the
string probabilistic in its name and save it in

  place of the current mpi-config.xml file. You should probably

backup the current functioning version of the mpi-config.xml file
so that you can get back

  to it if things go wrong. Before you start your server, take a

look at the settings of the probabilistic matching algorithm in
the mpi-config.xml file you used.

  Make sure that the namespaces in the xml file include the one for

the probabilistic matching algorithm. The sample configuration
files we supply are all

  valid so you should not have to do anything here but I mentioning

it because it is very important. The base loader of the
configuration file, uses the namespaces

  to figure out which module-specific loader it should use to load

portions of the overall mpi-config.xml file. Here is an example
below that includes the

  probabilistic matching algorithm.

  <mpi-config

xsi:schemaLocation="http://configuration.openempi.openhie.org/mpiconfig
mpi-config.xsd http://configuration.openempi.openhie

  .org/file-loader file-loader.xsd 

http://configuration.openempi.openhie.org/basic-blocking
basic-blocking.xsd http://configuration.opene

  [mpi.openhie.org/probabilistic-matching](http://mpi.openhie.org/probabilistic-matching) probabilistic-matching.xsd"

xmlns=“http://configuration.openempi.openhie.org/mpiconfig”
xmlns:mpi=

  ["http://configuration.openempi.openhie.org/mpiconfig"](http://configuration.openempi.openhie.org/mpiconfig)
  xmlns:fl=["http://configuration.openempi.openhie.org/file-loader"](http://configuration.openempi.openhie.org/file-loader)
  xmlns:bb="[http://c](http://c)

  [onfiguration.openempi.openhie.org/basic-blocking](http://onfiguration.openempi.openhie.org/basic-blocking)      "

xmlns:pm=“http://configuration.openempi.openhie.org/probabilistic-matching”
xmlns:xsi="

  [http://www.w3.org/2001/XMLSchema-instance](http://www.w3.org/2001/XMLSchema-instance)">

  c. Next check the section that configures the matching algorithm.

The pm namespace tells the loader that the probablistic matching
algorithm

  module-specific loader knows how to load that portion of the file,

so the loader will delegate the parsing process for this portion
of the mpi-config.xml

  file to that module. The settings below should be fairly

reasonable and generic enough to work but depending on the data in
your database

  you may need to adjust them. You need to also make sure that you

have a reasonable number of records in the system (in the order of
thousands)

  otherwise the estimated probability distributions will give you

unusual results.

      <pm:probabilistic-matching>

pm:false-negative-probability0.1</pm:false-negative-probability>

pm:false-positive-probability0.9</pm:false-positive-probability>

          <pm:match-fields>

              <pm:match-field>

pm:field-namegivenName</pm:field-name>

pm:agreement-probability0.9</pm:agreement-probability>

pm:disagreement-probability0.1</pm:disagreement-probability>

                  <pm:comparator-function>

JaroWinkler

                  </pm:comparator-function>

pm:match-threshold0.85</pm:match-threshold>

              </pm:match-field>

              <pm:match-field>

pm:field-namefamilyName</pm:field-name>

pm:agreement-probability0.9</pm:agreement-probability>

pm:disagreement-probability0.1</pm:disagreement-probability>

                  <pm:comparator-function>

Exact

                  </pm:comparator-function>

pm:match-threshold0.85</pm:match-threshold>

              </pm:match-field>

          </pm:match-fields>

pm:config-file-directory/home/odysseas/projects/openempi-development-2.1.2/openempi/conf</pm:config-file-directory>

      </pm:probabilistic-matching>

  2. You should now be able to start the system. At start-up the

probabilistic matching algorithm looks for a file called
FellegiSunterParameters.ser (or something like that)

  that contains the generated statistical model based on the

settings above. This does not exist when you start the system
after installation so, it will need to generate it.

  Thus, the server may take a few minutes to start the first time

(depending on the settings of the blocking algorithm and the
number of records in your database the amount

  of time needed will vary).

  3. Once everything is up and running, you should be able to go to

the Administrative console, modify the settings of the matching
algorithm and make adjustments if needed.

  If you do change the settings of the probabilistic matching

algorithm, you will need to relink all the records in the database
using these settings. You do that by selecting

  the Link all records option from the advanced menu.

  If you have any issues, Wentong or myself will be able to help you

while I am away. I may not have Internet access on a daily basis
while I am away but I will be checking my

  email fairly regularly.

  I am going to clean these notes and put them up on the OpenEMPI

documentation Wiki.

  Best regards,

  Odysseas

  Odysseas Pentakalos, Ph.D., PMP
Chief Technology Officer
SYSNET International, Inc.
2930 Oak Shadow Drive
Oak Hill, Virginia 20171
mailto:odysseas@sysnetint.com
(703) 855-2029