Data normalization techniques and autocoding algorithms for the medical dictionary for regulatory activities (MedDRA)

Data normalization techniques and autocoding algorithms for the medical dictionary for regulatory activities (MedDRA)

Tucker, Mike Don

Autocoding systems used for Medical Dictionary for Regulatory Activities (MedDRA) coding typically employ verbatim matching methods that attempt to electronically assign a MedDRA dictionary term to a raw adverse event term based upon explicit spellings. Not all raw terms can be directly matched to verbatim terms, thus, generally leaving a sizeable portion of terms to be manually coded. This paper presents data normalizing techniques and autocoding algorithms based upon orthographic properties that are intended to augment the basic linking procedures used by many electronic autocoding systems. Although MedDRA is highlighted in this paper, the normalization techniques and autocoding algorithms presented are suitable for other medical terminology repositories such as Coding Symbols for a Thesaurus of Adverse Reaction Terms (COSTART), World Health Organisation Adverse Reaction Terminology (WHOART), and International Classification of Diseases (ICD-9), as well as in-house dictionaries developed by various pharmaceutical companies. Autocoding systems that employ data normalization and additional matching algorithms may realize a higher electronic match rate, which directly reduces manual review time. Data from multiple clinical studies along with sample code, examples, and results are provided.

Key Words: Autocoding; Data normalization; Dictionary; MedDRA; Verbatim


THE RECENT ARRIVAL OF MedDRA with the backing of the International Conference on Harmonisation (ICH) has caused many pharmaceutical/biotechnology companies and supportive organizations to readdress their coding processes and systems. MedDRA is expected to be the standard coding resource for regulated medical terminology within ICH regions (1). For many organizations, this transition to a standard coding dictionary necessitates a process change (2). Changing procedures often impacts the technology employed to support the process, and provides an opportune time to advance electronic autocoding capabilities.

MedDRA itself is stratified across a five– level hierarchy ranging from maximum specificity found in Lowest Level Terms (LLTs) to the broadest grouping found in System Organ Classes (SOCs). Multiple terms may exist for each upward level in the hierarchy. For example, Preferred Terms (PTs) may have multiple LLTs and High Level Terms may have multiple PTs. This hierarchy is illustrated in Figure 1.

MedDRA accommodates both linear and multiaxial coding. Linear or “primary” coding is best described as the simple one-to– one-to-one relationship found across LLTs to PTs all the way across to SOCs. Primary coding essentially defaults to the primary SOC for any given PT. Multiaxial coding or “secondary” coding describes the choices for assigning PTs to one of several secondary SOCs if available (3).

The LLT itself constitutes the lowest level of the terminology. The LLT is used to characterize synonyms, lexical variants, and quasi-synonyms respective to PTs. Each LLT is linked to only one PT (4). For simplicity’s sake, this paper highlights simple coding of raw terms to LLTs.

The coding of patient data is critical to grouping, analysis, and reporting data. Coding directly impacts submissions for New Drug Applications (NDAs), safety surveillance, and product labeling. The process of coding itself benefits from electronic automation, which entails the use of computer programs to facilitate and expedite the matching of raw field terms with dictionary terms (5). For much of the pharmaceutical industry, these raw terms are referred to as Case Report Form (CRF) terms (although adverse events and medications can certainly come from other sources such as safety surveillance forms, investigator source documents, etc.).

The most widely used automated matching method is called the verbatim match, which entails electronically comparing the CRF term to the LLT term and determining if they are completely identical. For example, a CRF term HEADACHE would be a verbatim match to the MedDRA LLT HEADACHE. This exact match is the most accurate form of electronically identifying CRF terms with dictionary terms.

Some autocoding systems use a synonym repository to augment verbatim-like matches. Basically, a synonym repository is an electronic file that links previously encountered CRF terms to dictionary terms. In other words, this synonym file holds lexical links between previously coded CRF terms and MedDRA LI-Ts. A CRF term ACHING HEAD would not be a verbatim match to HEADACHE although it is clinically similar. However, it is still possible to attain an electronic autocoding match with ACHING HEAD via a precompiled synonym repository. Once a new nonverbatim term is encountered in the field and manually coded it can be stored in a synonym file for later retrieval and electronic matching. In the synonym file ACHING HEAD would be linked to the MedDRA LLT HEADACHE, which would then automatically code the next time this CRF term is encountered. The combination of verbatim matches and subsequent synonym matches is a common coding scheme for many autocoding systems.


It is of considerable value to increase the electronic match rate in autocoding systems since this will typically reduce manual review time for designated coding personnel. Before examining additional matching algorithms, there are steps that can be taken to normalize the CRF terms and MedDRA LLTs to increase match rates.

Choice of case plays a critical role in making verbatim and synonym matches. Some systems do not recognize a CRF term of HEADACHE as matching the MedDRA LLT “Headache” because the letter case is different in spite of the words being spelled the same. In order to avoid case issues, converting all terms to uppercase before executing the matching procedure is recommended. In SAS, this is accomplished with the UPCASE() function (6):


Trailing and leading blanks along with multiple inserted blanks in CRF terms can thwart legitimate electronic hits too. For example, suppose the CRF term SINUS HEADACHE has two blank characters inserted between SINUS and HEADACHE along with three trailing blanks. This will typically cause an electronic mismatch with the MedDRA LLT of SINUS HEADACHE. To safely correct this problem, removing extraneous blanks is recommended. In SAS this is accomplished with the COMPBL(), TRIM(), and LEFT() functions (6). COMPBL() will compress all multiple adjacent blanks in the term to just one blank. LEFT() will shift the term to the farthest left position in the field, thus, pushing leading blanks to the end. TRIM() will remove all trailing blanks. All three functions can be used in conjunction to remove extraneous blanks and reduce the raw CRF term to SINUS HEADACHE as follows:


I believe that changing case and/or removing extraneous blanks does not impact clinical content for coding purposes. However, the next normalization technique can potentially alter clinical content and should not be applied to verbatim matches, but should be categorized as a possible match.

Thus, both the CRF term and the LLT term will appear as BLACK OUT NOT AMNESIA, which provides an electronic match. Again, it should be understood that changing punctuation could potentially alter the clinical content of adverse events or medications, particularly where numbers and decimals are involved. Consequently, changes of this nature should only be considered possible matches and flagged or trapped for further manual review.

In relation to verbatim, synonym, or possible matches, flagging or displaying hits in autocoding systems according to the type of match acquired is recommended. A simple scheme would be to create a status data field in the autocoding system that traps the type of match made at the time of the electronic linking. A one-character status field could house “V” for verbatim, “S” for synonym, “P” for possible, and “N” for no match, which could be used to signal further action. This approach aids in appropriately distributing matched data, that is, either prompting a designated coder for further manual review or allowing a match to stand as is, and also aids system metrics. One can execute frequencies on the status field and use these figures to assess autocoding performance.

The previous normalization techniques are not limited to only CRF term to MedDRA LLT dictionary links. These techniques can equally apply to CRF term and synonym file links. In other words, the synonym file can be treated as a kind of pseudo-dictionary itself. In fact, the following autocoding algorithms also apply to synonym file links, which can result in an increased utility of synonym repositories beyond simple exact matches.


For the Consonant Recognition Algorithm, it should be understood that the larger the term, the more precise the match becomes. Likewise, smaller terms or abbreviations are more prone to error. Omitting small terms before they pass through tokenization is recommended. Whole terms with less than a total of five characters should be sifted from the procedure in order to maintain accuracy in links.


The Asymmetric Spelling Distance Algorithm computes the normalized cost for converting a key word to a query word via a sequence of character manipulations such as inserting, exchanging, or removing characters (6). Each specific manipulation has a respective numerical cost that is based upon the degree of difference, as displayed in Table 1. There may be multiple differences between two words, which requires a summation of the costs to arrive at a total cost. However, cumulative cost is not sufficient to fully characterize the difference between two terms. The length of the query word is important and should be factored in as well. Thus, distance is the sum of the costs divided (in integer arithmetic) by the length of the query word (6). Table 2 illustrates distance scoring based upon some sample character manipulations.

The catch to this method is determining the cut-off score for unacceptable matches. After running this SPEDIS() function over millions of matching iterations, I suggest that a distance score of less than 15 provides a reasonably accurate possible match. If the cut-off score is raised then more matches are acquired but this may come at the expense of accuracy. On the other hand, a cut-off score that is too low can omit reasonable matches and ultimately cause increased manual browsing time through a source dictionary. Thus, a distance cut-off score of 15 attempts to balance match rate and accuracy. However, this cut-off score can be slightly tweaked relative to its coding environment and demands.


Overly verbose descriptions of medical terminology can sometimes hinder electronic coding. Medical terms from the field often contain extraneous words that do not aid in computerized links to a coding dictionary. Even so, verbose characterization of events should contain key medical words that can be used to catch automatic matches. For example a CRF term of HERPES SIMPLEX OUTBREAK-LIP does not literally match to the MedDRA LLT of HERPES SIMPLEX, but the Encapsulated Word Comparison Algorithm can be used to attain a match. This algorithm takes the entire LLT term and encapsulates it as a virtual single word. This encapsulated LLT is then compared to see if it exists wholly within the CRF term. In the above example, the encapsulated LLT [HERPES SIMPLEX] does exist within the CRF term [HERPES SIMPLEX] OUTBREAK-LIP, which provides a possible match (brackets are for emphasis only and are not necessary as part of the encapsulating syntax). This comparison of terms works inversely too. A CRF term of [MOTOR NEUROPATHY] can be encapsulated as a virtual word and found within the MedDRA LLT of GENERALISED SENSORI[MOTOR NEUROPATHY]. This too provides a suitable match.

There are several other types of word manipulation techniques available. Inverting words such as URINATING PAIN can link to PAIN URINATING. Cross eliminating words without clinical meaning such as “to, for, due” can increase hit rates and cross linking individual words can aid in possible matches (5).


Using the techniques presented, I have personally observed match increases as much as 30% over simple verbatim matches for some clinical studies. Table 3, which uses live aggregate data across multiple studies, illustrates the effectiveness of additional matching algorithms as data first encounter an autocoding system. Not all autocoding systems use synonym repositories, so Table 3 omits synonym matches to clarify the results, plus synonym file routines do not trigger on the first encounter of raw data. In this particular table, we see a match rate increase of 25.79% over simple verbatim matches. Overall, the hit rate with verbatim plus possible matching procedures produced a 91.36% match rate. Results may vary according to the complexity of the field terminology or potentially increase if a synonym repository is used.


The order in which raw data flow through autocoding algorithms is critical to achieving quality matches. Trapping matches based upon the most accurate method first, then allowing nonmatched terms to flow into the next accurate algorithm, and so on is recommended. Obviously, verbatim matches would be the most accurate method, followed by matches found from a synonym file, if a synonym repository is used. Next come the possible type matches based upon the above algorithms. Having tested these algorithms across millions of sample iterations and thousands of live terms, I have concluded that the next series of matching methods would be the Consonant Recognition Algorithm, the Asymmetric Spelling Distance Algorithm, and the Encapsulated Word Comparison Algorithm. This order allows the most accurate match to occur before the data are fed into the next matching procedure. Once a match is found, the CRF term is sifted out of succeeding matching algorithms and is reported based upon the first best hit. Figure 2 displays this recommended flow.


There are many nuances and issues with electronic autocoding systems and associated processes that are beyond the scope of this paper. The intent here is to focus on additional techniques for increasing match rates that can be applied to most autocoding systems. Although SAS code was presented as syntax examples these techniques are not beyond other application-building environments such as database-driven systems or object-oriented languages. Nor are these techniques limited to just MedDRA; virtually any dictionary can be used. It has been demonstrated that data normalization and autocoding algorithms beyond simple verbatim matches can significantly increase quality autocoding hits, and that the order in which data flow through an autocoding system plays an important role. It has also been stated that increasing electronic hits can reduce manual review time.


1. MedDRA MSSO Management Board Working Party. ICH Harmonized Tripartite Guideline for Good MedDRA Usage. MedDRA MSSO Management Board Working Party. 1999:(September)4-13.

2. Cappelleri CC, Lianng Y Opportunities and challenges for industry and FDA. Biopharmaceutical Report. 1999;7(No.1):4-5.

3. Good Clinical Data Management Practices Committee. Good Clinical Data Management Practices. Hillsborough, NJ: Society for Clinical Data Management. 1999;2:49-52.

4. MedDRA Maintenance and Support Services Organization. MedDRA Introductory Guide. v5.0. Reston, VA: MedDRA Maintenance and Support Services Organization. 2002:(March)1-10.

5. Pitts K, Kathleen J, Benson L. Advances in volume encoding clinical data. Drug Inf J. 2001,35(4):1079– 1092.

6. SAS Institute Inc. SAS(R) Language Reference: Dictionary, Version 8. Cary, NC: SAS Institute Inc.; 1999.


Manager, Clinical Technologies, ILEX Oncology Inc., San Antonio, Texas

Reprint address: Mike Tucker, ILEX Oncology Inc., 4545 Horizon Hill Blvd., San Antonio, TX 78229-2263


Copyright Drug Information Association Oct-Dec 2002

Provided by ProQuest Information and Learning Company. All rights Reserved