Test Data

Always follow each function with assert tests - nbdev will run them and alert you if something has gone wrong. (Mark with #test, not #export!)

Alas, pandas 1.3 has dropped pd.testing.util which had 20ish functions to define test dataframes, so we have to write one. We'll call this before relevant tests.

df = makeMixedDataFrame()
assert df.loc[2,'A'] == 2.0
assert df.loc[2,'C'] == 'foo3'
assert df.loc[2,'D'] == pd.Timestamp('2009-01-05 00:00:00')

df = getCrashes()
df.sample(5)

Generalized Discretization

Discretize a whole dataframe into at most $N$ categories:

Bin numerics into $≤N$ bins.
Use only the Top $N$ categories, and "Other".

For QuickLooks, BN learning, and other household uses.

TODO: Try Maya Gilad's approach -- move the bottom x% into 'Other':

field = df[FILENAME]
field.mask(field.map(
    field.value_counts(normalize=True)) < 0.01, 'Other')

is_numeric

df = makeMixedDataFrame()
assert all(df.apply(is_numeric) == [True, True, False, True])

drop_singletons

Note that pd.NA and np.nan are values, so columns with only NA will be dropped, but columns NA and one other value remain.

df = makeMixedDataFrame()
df['E'] = [1, 1, 1, 1, 1]
df['F'] = pd.Series([1, 1, 1, None, None]).astype('UInt8')
df['G'] = pd.Series([1, 1, 1, None, None]).astype('float')
df['H'] = pd.Series([None]*5).astype('UInt8')

drop_singletons(df)

  DROPPED ['E', 'H'] because < 2 vals each.

assert all(df.columns == ['A', 'B', 'C', 'D', 'F', 'G'])

discretize

Woo-hoo! It's all been leading to this.

Seriously, these headers are redundant -- nbdev will generate nice-looking docs using the function names and docstrings.

But that requires Github or setting up jekyll, and I broke my env.

This should drop 'B' as a singleton, bin the two continuous cols, and convert 'C' into 'foo3', 'foo4', and 'Other'.

df = makeMixedDataFrame()
df = discretize(df, nbins=2)

A:
	(-0.001, 2.0]    3
	(2.0, 4.0]       2
B:
	(-0.001, 1.0]    5
C:
	foo1     1
	foo5     1
	Other    3
D:
	(2008-12-31 23:59:59.999999999, 2009-01-05]    3
	(2009-01-05, 2009-01-07]                       2
  DROPPED ['B'] because < 2 vals each.

assert all(df.columns == ['A', 'C', 'D'])

This is more of a "visual" test - no assert statement to fail.

df.A.unique()

[(-0.001, 2.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(-0.001, 2.0] < (2.0, 4.0]]

u = df.C.unique()
assert 'Other' in u and len(u) == 3

df.C.unique()

['foo1', 'Other', 'foo5']
Categories (3, object): ['foo1', 'Other', 'foo5']

	total	speeding	alcohol	not_distracted	no_previous	ins_premium	ins_losses	abbrev
15	15.7	2.669	3.925	15.229	13.659	649.06	114.47	IA
47	10.6	4.452	3.498	8.692	9.116	890.03	111.62	WA
27	14.9	1.937	5.215	13.857	13.410	732.28	114.82	NE
36	19.9	6.368	5.771	18.308	18.706	881.51	178.86	OK
20	12.5	4.250	4.000	8.875	12.375	1048.78	192.70	MD

mydemo

Test Data

`makeMixedDataFrame`[source]

`getCrashes`[source]

Generalized Discretization

is_numeric

`is_numeric`[source]

drop_singletons

`drop_singletons`[source]

discretize

`discretize`[source]

Parameters

Plotting helpers...

mydemo

Test Data

makeMixedDataFrame[source]

getCrashes[source]

Generalized Discretization

is_numeric

is_numeric[source]

drop_singletons

drop_singletons[source]

discretize

discretize[source]

Parameters

Plotting helpers...

`makeMixedDataFrame`[source]

`getCrashes`[source]

`is_numeric`[source]

`drop_singletons`[source]

`discretize`[source]