Datasets Subpackage
Submodules
coxkan.datasets.deepsurv module
- class coxkan.datasets.deepsurv.GBSG[source]
Bases:
_DeepSurvDatasetRotterdam & German Breast Cancer Study Group (GBSG)
A combination of the Rotterdam tumor bank and the German Breast Cancer Study Group.
This is the processed data set used in the DeepSurv paper (Katzman et al. 2018), and details can be found at https://doi.org/10.1186/s12874-018-0482-1
See https://github.com/jaredleekatzman/DeepSurv/tree/master/experiments/data for original data.
Covariate names restored from https://www.kaggle.com/datasets/utkarshx27/breast-cancer-dataset-used-royston-and-altman
- Variables:
- hormon
hormonal therapy, 0= no, 1= yes
- size
tumor size (0: <20 mm, 1: [20 mm to 50 mm], 2: > 50 mm))
- meno
menopausal status (0= premenopausal, 1= postmenopausal)
- age
age, years
- nodes
number of positive lymph nodes
- pgr
progesterone receptors (fmol/l)
- er
estrogen receptors (fmol/l)
- duration: (duration)
the right-censored event-times.
- event: (event)
event indicator {1: event, 0: censoring}.
- categorical_covariates = ['hormon', 'size', 'meno']
- covariates = ['hormon', 'size', 'meno', 'age', 'nodes', 'pgr', 'er']
- name = 'gbsg'
- class coxkan.datasets.deepsurv.METABRIC[source]
Bases:
_DeepSurvDatasetThe Molecular Taxonomy of Breast Cancer International Consortium (METABRIC).
Gene and protein expression profiles to determine new breast cancer subgroups in order to help physicians provide better treatment recommendations.
This is the processed data set used in the DeepSurv paper (Katzman et al. 2018), and details can be found at https://doi.org/10.1186/s12874-018-0482-1
According to the DeepSurv paper, the data was preprocessed according to the Immunohistochemical 4 plus Clinical (IHC4+C) test, such that the first four covariates are gene expression indicators and the last five are clinical features.
See https://github.com/jaredleekatzman/DeepSurv/tree/master/experiments/data for original data.
Covariate names restored from https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric (For the gene expressions, we calculate z-score to compare them)
- Variables:
- EGFR:
Epidermal growth factor receptor.
- PGR:
Progesterone receptor.
- ERBB2:
Human epidermal growth factor receptor 2.
- MKI67:
Ki-67 protein expression.
- hormone:
Hormone treatment indicator.
- radio:
Radiotherapy indicator.
- chemo:
Chemotherapy indicator.
- ER:
Estrogen receptor positive indicator.
- age:
Age at diagnosis.
- duration: (duration)
the right-censored event-times.
- event: (event)
event indicator {1: event, 0: censoring}.
- categorical_covariates = ['hormone', 'radio', 'chemo', 'ER']
- covariates = ['EGFR', 'PGR', 'ERBB2', 'MKI67', 'hormone', 'radio', 'chemo', 'ER', 'age']
- name = 'metabric'
- class coxkan.datasets.deepsurv.SUPPORT[source]
Bases:
_DeepSurvDatasetStudy to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT).
A study of survival for seriously ill hospitalized adults.
This is the processed data set used in the DeepSurv paper (Katzman et al. 2018), and details can be found at https://doi.org/10.1186/s12874-018-0482-1
See https://github.com/jaredleekatzman/DeepSurv/tree/master/experiments/data for original data.
Covariate names restored from https://hbiostat.org/data/repo/support2csv.zip
- Variables:
- age:
age in years.
- sex:
patient sex. (1: female, 0: male)
- race:
patient race (unfortunately, the original data did not provide which the values represent which races).
- comorbidity:
number of comorbidities.
- diabetes:
presence of diabetes.
- dementia:
presence of dementia.
- cancer:
presence of cancer. (2: yes, 1: no, 0: metastatic)
- meanbp:
mean arterial blood pressure.
- hr:
heart rate.
- rr:
respiration rate.
- temp:
temperature.
- sodium:
serum’s sodium.
- wbc:
white blood cell count.
- creatinine:
serum’s creatinine.
- duration: (duration)
the right-censored event-times.
- event: (event)
death indicator {1: death, 0: censoring}.
- categorical_covariates = ['sex', 'race', 'diabetes', 'dementia', 'cancer']
- covariates = ['age', 'sex', 'race', 'comorbidity', 'diabetes', 'dementia', 'cancer', 'meanbp', 'hr', 'rr', 'temp', 'sodium', 'wbc', 'creatinine']
- name = 'support'
coxkan.datasets.rdatasets module
- class coxkan.datasets.rdatasets.FLCHAIN[source]
Bases:
_BaseDatasetAssay of serum free light chain (FLCHAIN). Obtained from Rdatasets (https://github.com/vincentarelbundock/Rdatasets).
A study of the relationship between serum free light chain (FLC) and mortality. The original sample contains samples on approximately 2/3 of the residents of Olmsted County aged 50 or greater.
For details see http://vincentarelbundock.github.io/Rdatasets/doc/survival/flchain.html
- Variables:
- age:
age in years.
- sex:
F=female, M=male.
- sample.yr:
the calendar year in which a blood sample was obtained.
- kappa:
serum free light chain, kappa portion.
- lambda:
serum free light chain, lambda portion.
- flc.grp:
the FLC group for the subject, as used in the original analysis.
- creatinine:
serum creatinine.
- mgus:
1 if the subject had been diagnosed with monoclonal gammapothy (MGUS).
- futime: (duration)
days from enrollment until death. Note that there are 3 subjects whose sample was obtained on their death date.
- death: (event)
0=alive at last contact date, 1=dead.
- chapter:
for those who died, a grouping of their primary cause of death by chapter headings of the International Code of Diseases ICD-9.
- categorical_covariates = ['sex', 'sample.yr', 'flc.grp', 'mgus']
- covariates = ['age', 'sex', 'sample.yr', 'kappa', 'lambda', 'flc.grp', 'creatinine', 'mgus']
- duration_col = 'futime'
- event_col = 'death'
- load(split=False)[source]
Get dataset.
If ‘processed’ is False, return the raw data set. See the code for processing.
- name = 'flchain'
- class coxkan.datasets.rdatasets.NWTCO[source]
Bases:
_BaseDatasetData from the 3rd and 4th clinical trails National Wilm’s Tumor Study Group Obtained from Rdatasets (https://github.com/vincentarelbundock/Rdatasets).
Measurement error example. Tumor histology predicts survival, but prediction is stronger with central lab histology than with the local institution determination.
For details see http://vincentarelbundock.github.io/Rdatasets/doc/survival/nwtco.html
- Variables:
- instit:
- histology reading from local institution:
1: favorable
2: unfavorable
- histol:
- histology reading from central lab:
1: favorable
2: unfavorable
- stage:
- disease stage:
1: localized to the kidney and completely resected
2: spread beyond thekidney but completely resected
3: residual tumour in the abdomen or tumour in the lymphnodes
4: metastatic to the lung or liver.
- study:
clinical trial number (3 or 4)
- age:
age in months
- in.subcohort:
included in the subcohort for the example in the paper
- rel: (event)
indicator for relapse
- edrel: (duration)
time to relapse
- References
NE Breslow and N Chatterjee (1999), Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Applied Statistics 48, 457–68.
- categorical_covariates = ['instit', 'histol', 'stage', 'study', 'in.subcohort']
- covariates = ['instit', 'histol', 'stage', 'study', 'age', 'in.subcohort']
- duration_col = 'edrel'
- event_col = 'rel'
- load(split=False)[source]
Get dataset.
If ‘processed’ is False, return the raw data set. See the code for processing.
- name = 'nwtco'