ourtils.wrangling#
Data wrangling functions, typically with pandas
- class ourtils.wrangling.ColumnSpec(col_names: str | list[str], mapping: dict, use_numeric_order=False, cat_overrides=None)#
Bases:
objectSpecification for an individual column.
- apply_to_series(series: Series) Series#
- as_category() Categorical#
- class ourtils.wrangling.SpecCollection(specs: list[ColumnSpec])#
Bases:
objectA collection of columnspecs.
- map_to_dataframe(data: DataFrame) DataFrame#
Applies all column specs to a dataframe.
- ourtils.wrangling.collapse_multiindex(df: DataFrame, sep: str = '_') DataFrame#
Collapses a multi-index, this usually happens after some sort of aggregation.
Currently only supports an index that’s nested 1 level (so 2 levels)
- Parameters:
df – The input dataframe
sep – A delimiter to use when joining the index values
- ourtils.wrangling.cols_with_n_distinct_values(df, n_unique, op_name: str = 'le') Series#
Shows columns with a certain number of unique values
- ourtils.wrangling.compute_distinct_values(dat: DataFrame) DataFrame#
Gets value counts of all non-numeric columns in a dataframe.
- ourtils.wrangling.compute_pct_unique(series: Series) float#
Returns the % of unique values of a series. This function will attempt to convert the series argument to a series.
- ourtils.wrangling.create_column(df: DataFrame, colname: str, func: Callable, *args, **kwargs) DataFrame#
Creates a new column using a function that takes column names as strings.
- Parameters:
df – The input dataframe
colname – The name of the column you want to create
func – The function to apply to the columns
args – Column names to pass into
func
- Example:
In [1]: from ourtils.wrangling import create_column In [2]: df = pd.DataFrame({ ...: 'first': ['myfirst'], ...: 'last': ['mylast'] ...: }) ...: In [3]: def create_name(first: str, last: str) -> str: ...: return f'{last}, {first}' ...: In [4]: df.pipe(create_column, 'mynewcolumn', create_name, 'first', 'last') Out[4]: first last mynewcolumn 0 myfirst mylast mylast, myfirst
- ourtils.wrangling.crosstab(dat: DataFrame, group_by_vars, count_var) DataFrame#
Computes pct / counts of count_var by group_by_vars
crosstab(grouped, [‘sex’], ‘category’)
- ourtils.wrangling.filter_cols(data, n_distinct_thresh=3, selected_dtypes=None, applicator='both') DataFrame#
Filters a dataframe based on distinct counts, OR using select_include
- ourtils.wrangling.filter_random(df: DataFrame, col: str) DataFrame#
Returns the dataframe filtered to a random value of col.
- Parameters:
df – The input dataframe
col – The column to pick a random value from
- ourtils.wrangling.send_column_to(dat: DataFrame, move_cols: list[str] | str, send_to: Literal['front', 'back'] = 'front') DataFrame#
Sends a column / columns to the front / back of a dataframe
- ourtils.wrangling.shout(df: DataFrame, msg: str = None) DataFrame#
A simple function to be used with
pd.pipeto print out the size of a dataframe and an optional message.- Parameters:
df – The input dataframe
msg – The message you want to print
- Returns:
The original dataframe
- Example:
In [1]: from ourtils.wrangling import shout In [2]: output = ( ...: pd.DataFrame([{'a': 10}, {'a': 15}, {'a': 20}]) ...: .pipe(shout, 'Starting pipeline') ...: .loc[lambda x: x['a'] >= 15] ...: .pipe(shout, 'After filtering') ...: ); output ...: (3, 1): Starting pipeline (2, 1): After filtering Out[2]: a 1 15 2 20
- ourtils.wrangling.sort_col_manually(input_df: DataFrame, col_name: str, ordered_values: list)#
Sorts a column in a dataframe in a manual order
Example#
sort_col_manually(my_data, ‘variable’, [‘first’, ‘second’, ‘third’])
- ourtils.wrangling.squish(df: ~pandas.core.frame.DataFrame, index_var: str | list[str], col_sep: str = '_', agg_func: ~typing.Callable = <class 'list'>) DataFrame#
Reshapes wide data into long format and adds a “group” column.
- Parameters:
df – The input dataframe
index_var – The column or columns that uniquely identify
col_sep – The thing to split the columns on
agg_func – The function to use to aggregate the values. Defaults to a simple list
- Example:
In [1]: import pandas as pd In [2]: from ourtils.wrangling import squish In [3]: df = pd.DataFrame( ...: columns=['index_var', 'a_1', 'a_2', 'b_1', 'b_2', 'b_3'], ...: data=[ ...: (1, 2, 3, 4, 5, 6), ...: (10, 20, 30, 40, 50, 60) ...: ] ...: ) ...: In [4]: df Out[4]: index_var a_1 a_2 b_1 b_2 b_3 0 1 2 3 4 5 6 1 10 20 30 40 50 60 In [5]: df.pipe(squish, 'index_var') Out[5]: index_var group value 0 1 a [2, 3] 1 1 b [4, 5, 6] 2 10 a [20, 30] 3 10 b [40, 50, 60]