Question:
I asked this question and it helped me, but now my task is more complex.My dataframe has ~100 columns and values with 14 scales.
{'Diseasetype': {0: 'Oncology',
1: 'Oncology',
2: 'Oncology',
3: 'Nononcology',
4: 'Nononcology',
5: 'Nononcology'},
'Procedures1': {0: 100, 1: 300, 2: 500, 3: 200, 4: 400, 5: 1000},
'Procedures2': {0: 1, 1: 3, 2: 5, 3: 2, 4: 4, 5: 10},
'Procedures100': {0: 1000, 1: 3000, 2: 5000, 3: 2000, 4: 4000, 5: 10000}}
I want to convert each value in each column of the dataframe into a bucket value.My current solution is:
def encoding(col, labels):
return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)
onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'
df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)
It works well, if all columns of the dataframe has the same scale, but they do not. Remember, I have 14 different scales.I would like to update the code above (or get another solution), which would allow me to bucket data. I cannot use the same range of values for bucketing everything.
My logic is the following:
If
Disease
== Oncology
and Procedures1
on this scale, convert values to these buckets (1, 2, 3)If
Disease
== Oncology
and Procedures2
on this scale, convert values to these buckets (1, 2, 3)If
Disease
!= Oncology
and Procedures77
on this scale, convert values to these buckets (4, 5, 6)Example of a scale and buckets:
Procedures1 for Oncology: < 200 = 1, 200-400 = 2, >400 = 3
Procedures2 for Oncology: < 2 = 1, 2-4 = 2, >4 = 3
Procedures3 for Oncology: < 2000 = 1, 2000-4000 = 2, >4000 = 3
Procedures1 for nonOncology: < 200 = 4, 200-400 = 5, >400 = 6
Procedures2 for nonOncology: < 2 = 4, 2-4 = 5, >4 = 6
Procedures3 for nonOncology: < 2000 = 4, 2000-4000 = 5, >4000 = 6
Expected output (happy to provide more info!)
Diseasetype Procedures1 Procedures2 Procedures100
Oncology 1 1 1
Oncology 2 2 2
Oncology 3 3 3
Nononcology 4 4 4
Nononcology 5 5 5
Nononcology 6 6 6
Link with rules:Answer:
I used an helper file with all scales (source at the end of answer):
Use
melt
to flatten your dataframe then filter out your rows with query
and finally use pivot
to reshape your dataframe. You can execute each line independently to show the transformations:scales = pd.read_csv('scales.csv').fillna({'Start': -np.inf, 'End': np.inf})
out = (
df.melt('Diseasetype', var_name='Procedure', ignore_index=False).reset_index()
.merge(scales, on=['Diseasetype', 'Procedure'], how='left')
.query("value.between(Start, End)")
.pivot_table('Label', ['index', 'Diseasetype'], 'Procedure').astype(int)
.droplevel(0).rename_axis(columns=None).reset_index()
)
Output:>>> df
Diseasetype Procedures1 Procedures100 Procedures2
0 Oncology 1 1 1
1 Oncology 2 2 2
2 Oncology 3 3 3
3 Nononcology 4 4 4
4 Nononcology 5 5 5
5 Nononcology 6 6 6
Content of scales.csv
:Diseasetype,Procedure,Start,End,Label
Oncology,Procedures1,,200,1
Oncology,Procedures1,200,400,2
Oncology,Procedures1,400,,3
Oncology,Procedures2,,2,1
Oncology,Procedures2,2,4,2
Oncology,Procedures2,4,,3
Oncology,Procedures100,,2000,1
Oncology,Procedures100,2000,4000,2
Oncology,Procedures100,4000,,3
Nononcology,Procedures1,,200,4
Nononcology,Procedures1,200,400,5
Nononcology,Procedures1,400,,6
Nononcology,Procedures2,,2,4
Nononcology,Procedures2,2,4,5
Nononcology,Procedures2,4,,6
Nononcology,Procedures100,,2000,4
Nononcology,Procedures100,2000,4000,5
Nononcology,Procedures100,4000,,6
If you have better answer, please add a comment about this, thank you!
Leave a Review