A super demo for nbdev.
Always follow each function with assert
tests - nbdev
will run them and alert you if something has gone wrong. (Mark with #test
, not #export
!)
Alas, pandas 1.3 has dropped pd.testing.util
which had 20ish functions to define test dataframes, so we have to write one. We'll call this before relevant tests.
df = makeMixedDataFrame()
assert df.loc[2,'A'] == 2.0
assert df.loc[2,'C'] == 'foo3'
assert df.loc[2,'D'] == pd.Timestamp('2009-01-05 00:00:00')
df = getCrashes()
df.sample(5)
TODO: Try Maya Gilad's approach -- move the bottom x% into 'Other':
field = df[FILENAME]
field.mask(field.map(
field.value_counts(normalize=True)) < 0.01, 'Other')
df = makeMixedDataFrame()
assert all(df.apply(is_numeric) == [True, True, False, True])
Note that pd.NA and np.nan are values, so columns with only NA will be dropped, but columns NA and one other value remain.
df = makeMixedDataFrame()
df['E'] = [1, 1, 1, 1, 1]
df['F'] = pd.Series([1, 1, 1, None, None]).astype('UInt8')
df['G'] = pd.Series([1, 1, 1, None, None]).astype('float')
df['H'] = pd.Series([None]*5).astype('UInt8')
drop_singletons(df)
assert all(df.columns == ['A', 'B', 'C', 'D', 'F', 'G'])
This should drop 'B' as a singleton, bin the two continuous cols, and convert 'C' into 'foo3', 'foo4', and 'Other'.
df = makeMixedDataFrame()
df = discretize(df, nbins=2)
assert all(df.columns == ['A', 'C', 'D'])
This is more of a "visual" test - no assert
statement to fail.
df.A.unique()
u = df.C.unique()
assert 'Other' in u and len(u) == 3
df.C.unique()