This is from Odysseas regarding more specific configuration of probabilistic vs deterministic matching algorithms in OpenEMPI. Take a few minutes to consume this sometime before our next call, and perhaps we can toy with it once we have the sandboxes all talking to each other.
Jeremy Keiper
OpenMRS Core Developer
AMPATH / IU-Kenya Support
···
---------- Forwarded message ----------
From: Odysseas Pentakalos, Ph.D. odysseas@sysnetint.com
Date: Fri, Jun 14, 2013 at 9:47 AM
Subject: Re: meet about configuring OpenEMPI
To: Jeremy Keiper jeremy@openmrs.org
Cc: wentong@sysnetint.com
Hi Jeremy,
Here is an overview of the installation/configuration process for
switching from the exact/deterministic matching algorithm to the
probabilistic one.
1. The first step is to reconfigure your instance to load the
probabilistic module instead of the exact one during start-up.
There are two things that need to
be done.
a. First, you need to tell the system to load the probabilistic
module and you do that by configuring the
openempi-extensions-contexts.properties
file in the conf directory. The contents of that file are shown
below. Notice that all you do is include in the comma separated
list of extension contexts
the one for the probabilistic algorithm and remove or comment out
the exact-matching one.
#openempi.extension.contexts=applicationContext-module-basic-blocking.xml,applicationContext-module-file-loader.xml,applicationContext-mo
dule-exact-matching.xml
openempi.extension.contexts=applicationContext-module-basic-blocking.xml,applicationContext-module-file-loader.xml,applicationContext-mod
ule-probabilistic-matching.xml,applicationContext-module-basic-blocking-hp.xml
b. Next, you need to adjust the configuration of OpenEMPI in
mpi-config.xml to provide configuration settings for the
probabilistic matching algorithm. The
easiest way to do this is to take one of the sample
mpi-config*.xml file in the conf directory that includes the
string probabilistic in its name and save it in
place of the current mpi-config.xml file. You should probably
backup the current functioning version of the mpi-config.xml file
so that you can get back
to it if things go wrong. Before you start your server, take a
look at the settings of the probabilistic matching algorithm in
the mpi-config.xml file you used.
Make sure that the namespaces in the xml file include the one for
the probabilistic matching algorithm. The sample configuration
files we supply are all
valid so you should not have to do anything here but I mentioning
it because it is very important. The base loader of the
configuration file, uses the namespaces
to figure out which module-specific loader it should use to load
portions of the overall mpi-config.xml file. Here is an example
below that includes the
probabilistic matching algorithm.
<mpi-config
xsi:schemaLocation="http://configuration.openempi.openhie.org/mpiconfig
mpi-config.xsd http://configuration.openempi.openhie
.org/file-loader file-loader.xsd
http://configuration.openempi.openhie.org/basic-blocking
basic-blocking.xsd http://configuration.opene
[mpi.openhie.org/probabilistic-matching](http://mpi.openhie.org/probabilistic-matching) probabilistic-matching.xsd"
xmlns=“http://configuration.openempi.openhie.org/mpiconfig”
xmlns:mpi=
["http://configuration.openempi.openhie.org/mpiconfig"](http://configuration.openempi.openhie.org/mpiconfig)
xmlns:fl=["http://configuration.openempi.openhie.org/file-loader"](http://configuration.openempi.openhie.org/file-loader)
xmlns:bb="[http://c](http://c)
[onfiguration.openempi.openhie.org/basic-blocking](http://onfiguration.openempi.openhie.org/basic-blocking) "
xmlns:pm=“http://configuration.openempi.openhie.org/probabilistic-matching”
xmlns:xsi="
[http://www.w3.org/2001/XMLSchema-instance](http://www.w3.org/2001/XMLSchema-instance)">
c. Next check the section that configures the matching algorithm.
The pm namespace tells the loader that the probablistic matching
algorithm
module-specific loader knows how to load that portion of the file,
so the loader will delegate the parsing process for this portion
of the mpi-config.xml
file to that module. The settings below should be fairly
reasonable and generic enough to work but depending on the data in
your database
you may need to adjust them. You need to also make sure that you
have a reasonable number of records in the system (in the order of
thousands)
otherwise the estimated probability distributions will give you
unusual results.
<pm:probabilistic-matching>
pm:false-negative-probability0.1</pm:false-negative-probability>
pm:false-positive-probability0.9</pm:false-positive-probability>
<pm:match-fields>
<pm:match-field>
pm:field-namegivenName</pm:field-name>
pm:agreement-probability0.9</pm:agreement-probability>
pm:disagreement-probability0.1</pm:disagreement-probability>
<pm:comparator-function>
JaroWinkler
</pm:comparator-function>
pm:match-threshold0.85</pm:match-threshold>
</pm:match-field>
<pm:match-field>
pm:field-namefamilyName</pm:field-name>
pm:agreement-probability0.9</pm:agreement-probability>
pm:disagreement-probability0.1</pm:disagreement-probability>
<pm:comparator-function>
Exact
</pm:comparator-function>
pm:match-threshold0.85</pm:match-threshold>
</pm:match-field>
</pm:match-fields>
pm:config-file-directory/home/odysseas/projects/openempi-development-2.1.2/openempi/conf</pm:config-file-directory>
</pm:probabilistic-matching>
2. You should now be able to start the system. At start-up the
probabilistic matching algorithm looks for a file called
FellegiSunterParameters.ser (or something like that)
that contains the generated statistical model based on the
settings above. This does not exist when you start the system
after installation so, it will need to generate it.
Thus, the server may take a few minutes to start the first time
(depending on the settings of the blocking algorithm and the
number of records in your database the amount
of time needed will vary).
3. Once everything is up and running, you should be able to go to
the Administrative console, modify the settings of the matching
algorithm and make adjustments if needed.
If you do change the settings of the probabilistic matching
algorithm, you will need to relink all the records in the database
using these settings. You do that by selecting
the Link all records option from the advanced menu.
If you have any issues, Wentong or myself will be able to help you
while I am away. I may not have Internet access on a daily basis
while I am away but I will be checking my
email fairly regularly.
I am going to clean these notes and put them up on the OpenEMPI
documentation Wiki.
Best regards,
Odysseas
Odysseas Pentakalos, Ph.D., PMP
Chief Technology Officer
SYSNET International, Inc.
2930 Oak Shadow Drive
Oak Hill, Virginia 20171
mailto:odysseas@sysnetint.com
(703) 855-2029