ourtils.wrangling#

Data wrangling functions, typically with pandas

class ourtils.wrangling.ColumnSpec(col_names: str | list[str], mapping: dict, use_numeric_order=False, cat_overrides=None)#

Bases: object

Specification for an individual column.

apply_to_series(series: Series) Series#
as_category() Categorical#
class ourtils.wrangling.SpecCollection(specs: list[ColumnSpec])#

Bases: object

A collection of columnspecs.

map_to_dataframe(data: DataFrame) DataFrame#

Applies all column specs to a dataframe.

ourtils.wrangling.collapse_multiindex(df: DataFrame, sep: str = '_') DataFrame#

Collapses a multi-index, this usually happens after some sort of aggregation.

Currently only supports an index that’s nested 1 level (so 2 levels)

Parameters:
  • df – The input dataframe

  • sep – A delimiter to use when joining the index values

ourtils.wrangling.cols_with_n_distinct_values(df, n_unique, op_name: str = 'le') Series#

Shows columns with a certain number of unique values

ourtils.wrangling.compute_distinct_values(dat: DataFrame) DataFrame#

Gets value counts of all non-numeric columns in a dataframe.

ourtils.wrangling.compute_pct_unique(series: Series) float#

Returns the % of unique values of a series. This function will attempt to convert the series argument to a series.

ourtils.wrangling.create_column(df: DataFrame, colname: str, func: Callable, *args, **kwargs) DataFrame#

Creates a new column using a function that takes column names as strings.

Parameters:
  • df – The input dataframe

  • colname – The name of the column you want to create

  • func – The function to apply to the columns

  • args – Column names to pass into func

Example:
In [1]: from ourtils.wrangling import create_column

In [2]: df = pd.DataFrame({
   ...:     'first': ['myfirst'],
   ...:     'last': ['mylast']
   ...: })
   ...: 

In [3]: def create_name(first: str, last: str) -> str:
   ...:     return f'{last}, {first}'
   ...: 

In [4]: df.pipe(create_column, 'mynewcolumn', create_name, 'first', 'last')
Out[4]: 
     first    last      mynewcolumn
0  myfirst  mylast  mylast, myfirst
ourtils.wrangling.crosstab(dat: DataFrame, group_by_vars, count_var) DataFrame#

Computes pct / counts of count_var by group_by_vars

crosstab(grouped, [‘sex’], ‘category’)

ourtils.wrangling.filter_cols(data, n_distinct_thresh=3, selected_dtypes=None, applicator='both') DataFrame#

Filters a dataframe based on distinct counts, OR using select_include

ourtils.wrangling.filter_random(df: DataFrame, col: str) DataFrame#

Returns the dataframe filtered to a random value of col.

Parameters:
  • df – The input dataframe

  • col – The column to pick a random value from

ourtils.wrangling.send_column_to(dat: DataFrame, move_cols: list[str] | str, send_to: Literal['front', 'back'] = 'front') DataFrame#

Sends a column / columns to the front / back of a dataframe

ourtils.wrangling.shout(df: DataFrame, msg: str = None) DataFrame#

A simple function to be used with pd.pipe to print out the size of a dataframe and an optional message.

Parameters:
  • df – The input dataframe

  • msg – The message you want to print

Returns:

The original dataframe

Example:
In [1]: from ourtils.wrangling import shout

In [2]: output = (
   ...:     pd.DataFrame([{'a': 10}, {'a': 15}, {'a': 20}])
   ...:     .pipe(shout, 'Starting pipeline')
   ...:     .loc[lambda x: x['a'] >= 15]
   ...:     .pipe(shout, 'After filtering')
   ...: ); output
   ...: 
(3, 1): Starting pipeline
(2, 1): After filtering
Out[2]: 
    a
1  15
2  20
ourtils.wrangling.sort_col_manually(input_df: DataFrame, col_name: str, ordered_values: list)#

Sorts a column in a dataframe in a manual order

Example#

sort_col_manually(my_data, ‘variable’, [‘first’, ‘second’, ‘third’])

ourtils.wrangling.squish(df: ~pandas.core.frame.DataFrame, index_var: str | list[str], col_sep: str = '_', agg_func: ~typing.Callable = <class 'list'>) DataFrame#

Reshapes wide data into long format and adds a “group” column.

Parameters:
  • df – The input dataframe

  • index_var – The column or columns that uniquely identify

  • col_sep – The thing to split the columns on

  • agg_func – The function to use to aggregate the values. Defaults to a simple list

Example:
In [1]: import pandas as pd

In [2]: from ourtils.wrangling import squish

In [3]: df = pd.DataFrame(
   ...:     columns=['index_var', 'a_1', 'a_2', 'b_1', 'b_2', 'b_3'],
   ...:     data=[
   ...:         (1, 2, 3, 4, 5, 6),
   ...:         (10, 20, 30, 40, 50, 60)
   ...:     ]
   ...: )
   ...: 

In [4]: df
Out[4]: 
   index_var  a_1  a_2  b_1  b_2  b_3
0          1    2    3    4    5    6
1         10   20   30   40   50   60

In [5]: df.pipe(squish, 'index_var')
Out[5]: 
   index_var group         value
0          1     a        [2, 3]
1          1     b     [4, 5, 6]
2         10     a      [20, 30]
3         10     b  [40, 50, 60]