A super demo for nbdev.

Test Data

Always follow each function with assert tests - nbdev will run them and alert you if something has gone wrong. (Mark with #test, not #export!)

Alas, pandas 1.3 has dropped pd.testing.util which had 20ish functions to define test dataframes, so we have to write one. We'll call this before relevant tests.

makeMixedDataFrame[source]

makeMixedDataFrame()

Return a constant mixed-type dataframe [float, float, str, datetime]

getCrashes[source]

getCrashes(dataset='car_crashes')

df = makeMixedDataFrame()
assert df.loc[2,'A'] == 2.0
assert df.loc[2,'C'] == 'foo3'
assert df.loc[2,'D'] == pd.Timestamp('2009-01-05 00:00:00')
df = getCrashes()
df.sample(5)
total speeding alcohol not_distracted no_previous ins_premium ins_losses abbrev
15 15.7 2.669 3.925 15.229 13.659 649.06 114.47 IA
47 10.6 4.452 3.498 8.692 9.116 890.03 111.62 WA
27 14.9 1.937 5.215 13.857 13.410 732.28 114.82 NE
36 19.9 6.368 5.771 18.308 18.706 881.51 178.86 OK
20 12.5 4.250 4.000 8.875 12.375 1048.78 192.70 MD

Generalized Discretization

Discretize a whole dataframe into at most $N$ categories:

  • Bin numerics into $≤N$ bins.
  • Use only the Top $N$ categories, and "Other".

For QuickLooks, BN learning, and other household uses.

TODO: Try Maya Gilad's approach -- move the bottom x% into 'Other':

field = df[FILENAME]
field.mask(field.map(
    field.value_counts(normalize=True)) < 0.01, 'Other')

is_numeric

is_numeric[source]

is_numeric(col:str)

Returns True iff already numeric, or can be coerced. Usage: df.apply(is_numeric) Usage: is_numeric(df['colname'])

Returns Boolean series.

From: https://stackoverflow.com/questions/54426845/how-to-check-if-a-pandas-dataframe-contains-only-numeric-column-wise

df = makeMixedDataFrame()
assert all(df.apply(is_numeric) == [True, True, False, True])

drop_singletons

drop_singletons[source]

drop_singletons(df, verbose=1)

Drop columns with < 2 unique values. Inplace.

Note that pd.NA and np.nan are values, so columns with only NA will be dropped, but columns NA and one other value remain.

df = makeMixedDataFrame()
df['E'] = [1, 1, 1, 1, 1]
df['F'] = pd.Series([1, 1, 1, None, None]).astype('UInt8')
df['G'] = pd.Series([1, 1, 1, None, None]).astype('float')
df['H'] = pd.Series([None]*5).astype('UInt8')

drop_singletons(df)
  DROPPED ['E', 'H'] because < 2 vals each.
assert all(df.columns == ['A', 'B', 'C', 'D', 'F', 'G'])

discretize

Woo-hoo! It's all been leading to this.

Seriously, these headers are redundant -- nbdev will generate nice-looking docs using the function names and docstrings.

But that requires Github or setting up jekyll, and I broke my env.

discretize[source]

discretize(df, nbins=10, cut=qcut, verbose=2, drop_useless=True)

Discretize columns in {df} to have at most {nbins} categories.

  • Categorical columns: take the Top n-1 plus "Other"
  • Continuous columns: cut into {nbins} using {cut}.

Returns a new discretized dataframe with the same column names. Promotes discrete columns to categories.

Parameters

df: Dataframe to discretize nbins: Max number of bins to use. May return fewer. cut: Cutting method. Default pd.qcut. Consider pd.cut, or write your own. verbose: 0: silent, 1: colnames, 2: (Default) top N for each column drop_useless: Removes columns that have < 2 unique values.

Replaces numerical NA values with 'NA'.

This should drop 'B' as a singleton, bin the two continuous cols, and convert 'C' into 'foo3', 'foo4', and 'Other'.

df = makeMixedDataFrame()
df = discretize(df, nbins=2)
A:
	(-0.001, 2.0]    3
	(2.0, 4.0]       2
B:
	(-0.001, 1.0]    5
C:
	foo1     1
	foo5     1
	Other    3
D:
	(2008-12-31 23:59:59.999999999, 2009-01-05]    3
	(2009-01-05, 2009-01-07]                       2
  DROPPED ['B'] because < 2 vals each.
assert all(df.columns == ['A', 'C', 'D'])

This is more of a "visual" test - no assert statement to fail.

df.A.unique()
[(-0.001, 2.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(-0.001, 2.0] < (2.0, 4.0]]
u = df.C.unique()
assert 'Other' in u and len(u) == 3
df.C.unique()
['foo1', 'Other', 'foo5']
Categories (3, object): ['foo1', 'Other', 'foo5']

Plotting helpers...