![]() |
DOWNLOAD DATA |
VIDEO
Disease Associations
Brief Data Summary
|
|
All files are tab delimited text files
|
|||||||||||||||||||||||||
Source Data and Study Population:Hospital claims offer reliable, systematic, and complete data for disease detection [1,2,3]. Each record of our original dataset consists of the date of visit, a primary diagnosis and up to 9 secondary diagnoses, all specified by ICD9 codes of up to 5 digits. The first three digits specify the main disease category while the last two provide additional information about the disease. In total, the ICD-9-CM classification consists of 657 different categories at the 3 digit level and 16,459 categories at 5 digits. For a detailed list of currently used ICD9 codes see www.icd9data.com. We compiled raw Medicare claims [4,5] based on MedPAR records for 1990-1993 that contain information on 96% of elderly Americans whether they use health care or not [6]. For the 32 million elderly Americans aged 65 or older enrolled in Medicare and alive for the entire study period, there were a total of 32,341,347 inpatient claims, pertaining to 13,039,018 individuals (the remaining individuals were not hospitalized at any point during this period). Demographically, our data set consists of patients over 65 years old (see Fig 1 for the age distribution) and is composed mainly of white patients, with a higher percentage of females (Fig 2). Yet, the data set is large enough to estimate race and gender specific comorbidity patterns. Data LimitationsThe medical claims were made available to us is in the ICD-9-CM format, representing a controlled nomenclature constructed mainly for insurance claim purposes. Therefore in some cases more than one code corresponds to a particular disease, whereas in other cases codes are not specific enough for research purposes. For example, at the 5-digit level there are 33 diagnoses associated with hypertension, which reduce to five at the 3-digit level. The vast majority of diseases however, can be univocally assigned to an ICD9 code. While hospital claims have been proposed as a reliable method for disease detection [7,8,9], our data does not capture a complete cross section of the population. The dataset consists of medical claims associated with hospitalizations of elderly citizens in the United States, thus it contains limited information about diseases that are not common among elders from an industrialized country, such as many infectious diseases or pregnancy related conditions. Nor does it contain information on patients who were not hospitalized. It does contain however, a wealth of information about different types of heart diseases and cancers, which are highly prevalent among elderly patients and are of major interest to the medical community. We distinguish four main groups in the data set given by (Males = M, Females=F, White=W, Black = B) Number of Patients per Demographic Group
REFERENCES: 1 Zhang J, Iwashyna TJ & Christakis NA. (1999) The Performance of Different Lookback Periods and Sources of Information for Charlson Comorbidity Adjustment in Medicare Claims. Medical Care 37: 1128-1139. 2 Cooper GS et al. (1999) The sensitivity of Medicare claims data for case ascertainment of six common cancers. Medical Care 37: 436-44. 3 Benesch C et al. (1997) Inaccuracy of the International Classification of Disease (ICD-9-CM) in identifiying the diagnosis of ischemic cerebrovascular disease. Neurology 49: 660-664. 4 Lauderdale D, Furner SE, Miles TP & Goldberg J. (1993) Epidemiological uses of Medicare data. American Journal of Epidemiology 15: 319-327. 5 Mitchell JB et al. (1994) Using Medicare claims for outcomes research. Medical Care 32: S38-JS51. 6 Hatten J. (1980) Medicare's Common Denominator: The Covered Population. Health Care Financing Review 2, 53-64. 7 Zhang J, Iwashyna TJ & Christakis NA. (1999) The Performance of Different Lookback Periods and Sources of Information for Charlson Comorbidity Adjustment in Medicare Claims. Medical Care 37: 1128-1139. 8 Cooper GS et al. (1999) The sensitivity of Medicare claims data for case ascertainment of six common cancers. Medical Care 37: 436-44. 9 Benesch C et al. (1997) Inaccuracy of the International Classification of Disease (ICD-9-CM) in identifiying the diagnosis of ischemic cerebrovascular disease. Neurology 49: 660-664.
|
VIDEO
|
||||||||||||||||||||||||||||
FIGURES
Fig 1. Age distribution for the study population.
Fig 2. Demographic breakdown of the study population.
|
|||||||||||||||||||||||||||||
|
To measure relatedness starting from disease co-occurrence, we need to quantify the strength of comorbidities by introducing a notion of “distance” between two diseases. A difficulty of this approach is that different statistical distance measures have biases that over- or under-estimate the relationships between rare or prevalent diseases. These biases are important given that the number of times a particular disease is diagnosed (prevalence) follows a heavy tailed distribution (Fig 3), meaning that while most diseases are rarely diagnosed, a few diseases have been diagnosed in a large fraction of the population. Hence, quantifying comorbidity often requires us to compare diseases affecting a few dozen patients with diseases affecting millions. We will use two comorbidity measures to quantify the distance between two diseases: the One crucial question is: how does the predictive power of comorbidity based relationships compare with that of heredity and known genetic markers? Of the two measures discussed in Box 1, Relative Risk (RR) enjoys the most widespread use in the medical literature [ 1-10], making it the most suitable for such comparison. We find that the relative risk of being diagnosed with one disease given another disease affecting a patient in our data varies in the range RR ~0.25-16 (Fig 4). Sibling studies have found that the relative risk of having a disease given that a sibling has the same disease typically ranges from RR ~3 for type 2 diabetes [1 ] to RR ~2-7 for early myocardial infarction [2 ], ~7-10 for bipolar disorder [3 ,4 ] and rheumatoid arthritis [5] and ~17-35 for Crohn’s Disease [6]. Most of these values fall in the range of relative risks associated with our observed comorbidities. Hence, statistically speaking, the magnitude of the disease risk predicted by comorbidity relationships is comparable to that of family history. Furthermore, we can compare comorbidity statistics with typical relative risk values found in genetic susceptibility studies. For example, the relative risk of type 2 diabetes for carriers of the at-risk allele TCF7L2 ranges between RR ~1.45 and 2.41 [7 ], whereas the rs2476601 SNP in the PTPN22 gene confers a genetic relative risk for rheumatoid arthritis of RR ~1.8 [8 ,9]. In contrast, the RR for a type 2 diabetes of a patient diagnosed with Ischemic Heart Disease is RR ~1.61, whereas a rheumatoid arthritis patient is at RR ~3.64 for the disease if he or she is diagnosed with osteoporosis [10]. The statistical strength of the observed comorbidities is therefore comparable to that found in siblings and genetic susceptibility studies, a favorable comparison that provides further motivation to use comorbidity data to explore disease risk.
REFERENCES 1 Kobberling J. & Tattersall R. The Genetics of Diabetes Mellitus (Academic Press, London, 1982) 2 Lusis AJ, Mar R & Pajukanta P. Genetics of atherosclerosis. Annu. Rev. Genomics Hum. Genet. 5, 189-218 (2004) 3 Craddock N, O’ Donovan MC & Owen MJ. The genetics of schizophrenia and bipolar disorder: dissecting psychosis J Med Genet 42, 193-204 (2005) 4 McGuffin et al. The heritability of bipolar affective disorder and the genetic relationships to unipolar depression. Arch Gen Psychiatry 60, 497-502 (2003) 5 Wordsworth P and Bell J. Polygenic susceptibility in rheumatoid arthritis. Ann. Rheum. Dis. 50,343-346 (1991) 6 The Wellcome Trust Case Control Consortium, Genome-wide association study of 14000 cases of seven common diseases and 3000 shared controls, Nature 447 661-677 (2007) 7 Grant SFA et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes, Nature Genetics, 38 320-323. 8 Begovich AB et al. A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am. J. Hum. Genet. 75 330-337 (2004). 9 Hinks, A. Eyre, S. Barton A., Thomson W & Worthington J. Investigation of genetic variation across PTPN22 in UK rheumatoid arthritis (RA) patients. Ann. Rheum. Dis. 66, 683-686 (2006) 10 Both RR were calculated using ICD9 codes at the 5-digit level for the entire study population. 11 Cohen J, Cohen P, West SG, Aiken LS, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. (2003) Third Edition. Lawrence Erlbaum Associates, Publishers, Mahwah, New Jersey. |
VIDEO Quantifying Disease Comorbidity
RR Biases
|
||
FIGURES
Fig 3. Disease prevalence distribution
Fig 4 Comorbidity distribution calculated for all diseases using ICD9 codes at the 5 digit level. |
QUANTIFYING COMORBIDITY STRENGTH: We can quantify the strength of comorbidities by calculating the correlation coefficient associated with a pair of diseases as: Where Cij is the number of patients affected by both diseases, N is the total number of patients in the studied population and Pi is the prevalence of the ith disease. The f correlation is the Pearson’s correlation for dichotomous variables, i.e. variables which only take 0 or 1 values. [1 ]
We can determine the significance of
Where n is the number of observations used to calculate Relative RiskAn alternative way of quantifying the correlation between two variables is to calculate their relative risk. The relative risk is the ratio between the observed co-occurrence and that of a null model. If diseases occurred completely independent from each other the number of patients affected by both diseases would be given by:
Hence the relative risk of a pair of diseases is given by:
which can also be written explicitly as probabilities as:
Calculating the significance of the relative risk can be done by using the Katz et al. method to estimate confidence intervals [2]. According to their calculations, the 99% confidence interval for the RR between diseases i and j is given by:
where
REFERENCES 1 Cohen J, Cohen P, West SG & Aiken LS. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. LEA, Mahwah, NJ (2003) 2 Katz, D., Baptista J., Azen, S.P. and Pike M.C. (1978) Obtaining confidence interval for the risk ratio in cohort studies. Biometrics34, 469-474
|
|
![]()