Preparing portfolio tree
In Asset Management, big companies manage hundreds of portfolios. And they love to organize their portfolios into a tree structure. These can be organization by Asset Type, such as Public vs Private, Equity vs Fixed Income. These can be organization by teams, such as Equity Desk, FI Desk, Derivativees Desk etc.
Imagine you are a system provider for such an asset manager. The customer may send request to construct a tree in below manner. Here Node
s are referred to as portgroups. Portgroups may contain other portgroups or portfolios. Portfolio
is leaf level node and can not contain any other nodes.
Level1 | Level2 | Level3 |
---|---|---|
Node1 | ||
Node2 | ||
Portfolio1 | ||
Portfolio2 | ||
Node3 | ||
Portfolio3 | ||
Portfolio4 | ||
Portfolio5 |
In addition to above table, customer also sends a table that tells whether particular Node is portgroup or a portfolio.
Node ticker | Portfolio or Portgroup |
---|---|
Node1 | Portgroup |
Node2 | Portgroup |
Node3 | Portgroup |
Portfolio1 | Portfolio |
Portfolio2 | Portfolio |
Portfolio3 | Portfolio |
Portfolio4 | Portfolio |
Portfolio5 | Portfolio |
Suppose you have to insert this data into port_group
table that consists of parent
and child
columns. You would have to prepare beleow kind of dataframe to insert it into sql table.
Parent | Child |
---|---|
Node1 | Node2 |
Node1 | Node3 |
Node1 | Portfolio4 |
Node1 | Portfolio5 |
Node2 | Portfolio1 |
Node2 | Portfolio2 |
Node3 | Portfolio3 |
Node1 | Portfolio4 |
Node1 | Portfolio5 |
Somehow you will need to extract parent child relationship from table 1.
If each Level in table had parents defined in each cell, you could run below kind of function to define such parent child relationship.
df = pd.read_csv("table1.csv")
parent_child_list = []
for parent_col, child_col in zip(df.columns[:-1], df.columns[1:]):
pcdf = df[[parent_col,child_col]]
pcdf = pcdf.drop_duplicates()
pcdf = pcdf[~pcdf[child_col].isnull()]
pcdf.columns = ["parent","child"]
parent_child_list.append(pcdf)
parent_child_df = pd.concat(parent_child_list)
But for above function to work, our table 1 should look like below
Level1 | Level2 | Level3 |
---|---|---|
Node1 | ||
Node1 | Node2 | |
Node1 | Node2 | Portfolio1 |
Node1 | Node2 | Portfolio2 |
Node1 | Node3 | |
Node1 | Node3 | Portfolio3 |
Node1 | Portfolio4 | |
Node1 | Portfolio5 |
So how do we transform table 1 to above view.
We can use pandas
ffill()
function to forward fill NaN
values.
But that would also forward copy unwanted datapoints. For instance we wouldn't want to forward copy Portfolio2
into the cell below it, because if we did that then we would end up adding Portfolio2
into Node3
which is not something customer requested.
Here is our solution
# let's conduct forward fill
newtreedf = treedf.copy()
for col in treedf.columns:
newtreedf[col] = treedf[col].ffill()
# set funds that were ffilled and their whole pg path up to top node back to blank
# first treedf fillna with blank and convert to string
treedf[col] = treedf[col].fillna("")
treedf[col] = treedf[col].apply(str)
for i, row in newtreedf[[col]].iterrows():
if row[col] in funds:
if treedf.loc[i, col] == "":
newtreedf.loc[i, col] = ""
level_no = col[-1]
parent_cols = [f"Level{i}" for i in range(1, int(level_no))]
# iterate over parent levels in reverse order and set them to blank
# if it encounters a parent level value which is a fund stop the process
# because that is a separate pg path for another fund
for parent_col in reversed(parent_cols):
if newtreedf.loc[i, parent_col] not in funds:
newtreedf.loc[i, parent_col] = ""
else:
break
new_tree_filename = "new_tree_ffilled.xlsx"
newtreedf.to_excel(wfolder / new_tree_filename, index=False)
To understand the use of algorithm above let's describe how simple ffiled table will look like when we just ffilled last Level.
Level1 | Level2 | Level3 |
---|---|---|
Node1 | ||
Node1 | Node2 | |
Node1 | Node2 | Portfolio1 |
Node1 | Node2 | Portfolio2 |
Node1 | Node3 | Portfolio2 |
Node1 | Node3 | Portfolio3 |
Node1 | Portfolio4 | Portfolio3 |
Node1 | Portfolio5 | Portfolio3 |
In above we can clearly see the problem. The values highlighted in bold need to be removed.
And the algorithm to remove, or set back to blank the right values is described in the code above.
Let's assume we just ffilled last column Level3. Now we will iterate over every value in Level3 column. When we encounter *Portfolio2 first time we compare it against non-ffilled tree dataframe and will confirm that it wasn't blank there, so will leave it alone. When we encounter it second time, we will find out that in non-ffilled table its location is blank, so we will set it to blank. But we also need to set all parent portgroups in that row to blank too. For that we prepare parent_cols using list comprehension and then iterate over it in reverse order. We set values to blank in that row as long as it is a portgroup, because that parnt path is not needed anymore.
But if we stumble upon another portfolio when iterating over parent_cols in reverse order we stop the act of setting to blank. And that logic is clear when you look at last two rows of above table which has Portfolio3 forward filled twice. In this case we just need to set Portfolio3 to blank in the last two rows. Portfolio4 and Portfolio5 and their corresponding parents should be kept intact. And that's what the algorithm achieves with below code
...
for parent_col in reversed(parent_cols):
# parent is a portgroup so set it to blank
if newtreedf.loc[i, parent_col] not in funds:
newtreedf.loc[i, parent_col] = ""
else:
# parent is another fund with its corresponding parents, so stop the act of setting to blank
break
Now that we have accurately ffilled tree table, we can construct parent child table out of it easily using code we provided earlier.
Before setting up a tree, you might have to clean up the existing tree. For that you may benefit from below code that retrieves a tree from a table that stores parent child relationships. Below code calls get_underlying_portfolios
function recursively to get the whole tree.
def get_port_group_tree(pg, con):
"""
Get port group tree from specified top level port group
:param pg: port group name
:param con: ODBC connection
:return:
"""
# append a tuple of (<portfolio_group>,<portfolio_name>,<portfolio_code>,<depth_level>)
pg_df_list = []
depth_level = 0
def get_underlying_portfolios(portgroups, depth_level):
if len(portgroups) == 0:
return
query = f"SELECT * FROM portdb.dbo.port_group WHERE portfolio_group in ({generate_comma_separated_text(portgroups)}) and portfolio_name not in ({generate_comma_separated_text(portgroups)})"
pg_df = pd.read_sql(query, con)
pg_df = strip_string_columns(pg_df, ['portfolio_group', 'portfolio_name'])
pg_df['depth_level'] = depth_level
pg_df_list.append(pg_df)
get_underlying_portfolios(pg_df['portfolio_name'].unique(), depth_level + 1)
get_underlying_portfolios([pg], depth_level)
if len(pg_df_list) > 0:
pg_tree_df = pd.concat(pg_df_list)
pg_tree_df.index = range(len(pg_tree_df))
return pg_tree_df
Happy coding! Hope you enjoyed this small project :) I hope you will find it helpful in some of your own projects.