Feature Engineering for Alpha Factor Research

Alpha factors are mathematical expressions or models that aim to quantify the skill of an investment strategy in generating excess returns beyond what would be expected given the associated risks i.e. it represents the active return of a portfolio.

Here is the data that we are going to use:

The wiki prices is a NASDAQ dataset that has stock prices, dividends, and splits for 3000 US publicly-traded companies wiki prices data

The US Equities Meta-Data can be found on my github. Beware it is a large file ~1.8 GB. us equities meta data

Imports

%pip install pandas_datareader
%pip install statsmodels

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import pandas_datareader.data as web

import seaborn as sns

from statsmodels.regression.rolling import RollingOLS

import statsmodels.api as sm

from datetime import datetime

START = 2000
END = 2018

idx = pd.IndexSlice

Read and Store Data

# parse prices data 

df_prices = (pd.read_csv('WIKI_PRICES.csv',
                 parse_dates = ['date'],
                 index_col = ['date','ticker'],
                 infer_datetime_format=True)).sort_index()

df_screener = pd.read_csv('us_equities_meta_data.csv')

# store price and stock data 

with pd.HDFStore('assets.h5') as store:
    store.put('data_prices',df_prices)
    store.put('data_screener',df_screener)

# read in prices and stocks 
'''
get all the dates between START and END and 
then get the adj_close column 
reshape df by turning the tickers into the columns and have the adj value for every date 
we convert the rows into columns so that we can easily access the tickers later on 
'''


with pd.HDFStore('assets.h5') as store:
    prices = (store['data_prices']
              .loc[idx[str(START):str(END), :], 'adj_close']
              .unstack('ticker'))
    stocks = store['data_screener'].loc[:, ['ticker','marketcap', 'ipoyear', 'sector']]

prices.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4706 entries, 2000-01-03 to 2018-03-27
Columns: 3199 entries, A to ZUMZ
dtypes: float64(3199)
memory usage: 114.9 MB

stocks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6834 entries, 0 to 6833
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ticker     6834 non-null   object 
 1   marketcap  5766 non-null   float64
 2   ipoyear    3038 non-null   float64
 3   sector     5288 non-null   object 
dtypes: float64(2), object(2)
memory usage: 267.0+ KB

Data Cleaning

remove stock duplicates and reset the index name for later data manipulation

stocks = stocks[~stocks.index.duplicated()]
stocks = stocks.set_index('ticker')

get all of the common tickers with their price information and metadata

shared = prices.columns.intersection(stocks.index)

stocks = stocks.loc[shared, :]
stocks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2412 entries, A to ZUMZ
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   marketcap  2407 non-null   float64
 1   ipoyear    1065 non-null   float64
 2   sector     2372 non-null   object 
dtypes: float64(2), object(1)
memory usage: 75.4+ KB

prices = prices.loc[:, shared]
prices.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4706 entries, 2000-01-03 to 2018-03-27
Columns: 2412 entries, A to ZUMZ
dtypes: float64(2412)
memory usage: 86.6 MB

assert prices.shape[1] == stocks.shape[0]

Create a Monthly Return Series

monthly_prices = prices.resample('M').last()

monthly_prices.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 219 entries, 2000-01-31 to 2018-03-31
Freq: M
Columns: 2412 entries, A to ZUMZ
dtypes: float64(2412)
memory usage: 4.0 MB

Here, we calculate the precentage change of monthly prices over a lag period. This represents the return over the specified number of months.

Clip outliers in the data based on the outlier_cutoff quantile and the 1- outlier_cutoff to their respective quantile values. Add 1 to the clipped values. This is to handle returns and convert percentage changes back to absolute returns.

Raise each value to the power of \(\frac{1}{lag}\). This is used to annualize returns when the lag is representative of months.

outlier_cutoff = 0.01
data = pd.DataFrame()
lags = [1, 2, 3, 6, 9, 12]
for lag in lags:
    data[f'return_{lag}m'] = (monthly_prices
                           .pct_change(lag) 
                           .stack()         
                           .pipe(lambda x: x.clip(lower=x.quantile(outlier_cutoff),
                                                  upper=x.quantile(1-outlier_cutoff))) 
                           .add(1)
                           .pow(1/lag)
                           .sub(1)
                           )
data = data.swaplevel().dropna()
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 399525 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   return_1m   399525 non-null  float64
 1   return_2m   399525 non-null  float64
 2   return_3m   399525 non-null  float64
 3   return_6m   399525 non-null  float64
 4   return_9m   399525 non-null  float64
 5   return_12m  399525 non-null  float64
dtypes: float64(6)
memory usage: 19.9+ MB

Drop Stocks with Less than 10 Years of Returns

min_obs = 120

# get the number of observations for each ticker 
nobs = data.groupby(level='ticker').size()

# get the indices of the tickers where the num of observations is greater than min_obs
keep = nobs[nobs>min_obs].index

# get the data with appropriate tickers and their respective data 
data = data.loc[idx[keep,:], :]
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   return_1m   360752 non-null  float64
 1   return_2m   360752 non-null  float64
 2   return_3m   360752 non-null  float64
 3   return_6m   360752 non-null  float64
 4   return_9m   360752 non-null  float64
 5   return_12m  360752 non-null  float64
dtypes: float64(6)
memory usage: 18.0+ MB

data.describe()

	return_1m	return_2m	return_3m	return_6m	return_9m	return_12m
count	360752.000000	360752.000000	360752.000000	360752.000000	360752.000000	360752.000000
mean	0.012255	0.009213	0.008181	0.007025	0.006552	0.006296
std	0.114236	0.081170	0.066584	0.048474	0.039897	0.034792
min	-0.329564	-0.255452	-0.214783	-0.162063	-0.131996	-0.114283
25%	-0.046464	-0.030716	-0.023961	-0.014922	-0.011182	-0.009064
50%	0.009448	0.009748	0.009744	0.009378	0.008982	0.008726
75%	0.066000	0.049249	0.042069	0.031971	0.027183	0.024615
max	0.430943	0.281819	0.221789	0.154555	0.124718	0.106371

sns.clustermap(data.corr('spearman'), annot=True, center=0, cmap='Blues');

data.index.get_level_values('ticker').nunique()

Rolling Factor Betas

In this section we are going to estimate the factor exposure of the listed stocks in the database according to the Fama-French Five-Factor Model.

Mkt-Rf is the market risk premium \((R_m - R_f)\). The difference between these two factors represents the additional return investors demand for bearing the systemic risk associated with the market.

SMB is small minus big. This factor represents the spread between small-cap and large-cap stocks.

SMB is typically measured in terms of market capitalization which is represented as \(MarketCapitalization = currentStockPrice * TotalNumberOfOutstandingShares\)

HML is high minus low, which is the spread between high book-to-market and low book-to-market stocks. The book-to-market ratio is \(\frac{shareholder'sEquity}{marketCapitalization}\).

RMW is robust minus weak. This compares the returns of firms with higher operating profitability and those with weak operating probitability

CMA is conservative minus aggressive and it gauges the difference between companies that invest aggressively and those that do so more conservatively.

factors = ['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA']
factor_data = web.DataReader('F-F_Research_Data_5_Factors_2x3', 'famafrench', start='2000')[0].drop('RF', axis=1)
factor_data.index = factor_data.index.to_timestamp()
factor_data = factor_data.resample('M').last().div(100)
factor_data.index.name = 'date'
factor_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 289 entries, 2000-01-31 to 2024-01-31
Freq: M
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  289 non-null    float64
 1   SMB     289 non-null    float64
 2   HML     289 non-null    float64
 3   RMW     289 non-null    float64
 4   CMA     289 non-null    float64
dtypes: float64(5)
memory usage: 13.5 KB

factor_data = factor_data.join(data['return_1m']).sort_index()
factor_data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Mkt-RF     360752 non-null  float64
 1   SMB        360752 non-null  float64
 2   HML        360752 non-null  float64
 3   RMW        360752 non-null  float64
 4   CMA        360752 non-null  float64
 5   return_1m  360752 non-null  float64
dtypes: float64(6)
memory usage: 18.0+ MB

Unlike traditional linear regression that considers an entire dataset, rolling linear regression calculates regresion coefficients and other statistics for a specified window of consecutive data points and then moves a window forward one observation at a time. Rolling Linear Regression is helpful when delaing with non-stationary time series data.

T = 24
betas = (factor_data.groupby(level='ticker',
                             group_keys=False)
         .apply(lambda x: RollingOLS(endog=x.return_1m,
                                     exog=sm.add_constant(x.drop('return_1m', axis=1)),
                                     window=min(T, x.shape[0]-1))
                .fit(params_only=True)
                .params
                .drop('const', axis=1)))

betas.describe().join(betas.sum(1).describe().to_frame('total'))

	Mkt-RF	SMB	HML	RMW	CMA	total
count	318478.000000	318478.000000	318478.000000	318478.000000	318478.000000	360752.000000
mean	0.979365	0.626588	0.122610	-0.062073	0.016754	1.485997
std	0.918116	1.254249	1.603524	1.908446	2.158982	3.306487
min	-9.805604	-10.407516	-15.382504	-23.159702	-18.406854	-33.499590
25%	0.463725	-0.118767	-0.707780	-0.973586	-1.071697	0.000000
50%	0.928902	0.541623	0.095292	0.037585	0.040641	1.213499
75%	1.444882	1.304325	0.946760	0.950267	1.135600	3.147199
max	10.855709	10.297453	15.038572	17.079472	16.671709	34.259432

cmap = sns.diverging_palette(10,220,as_cmap=True)
sns.clustermap(betas.corr(),annot=True,cmap=cmap,center=0);

data = (data.join(betas.groupby(level='ticker').shift()))
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   return_1m   360752 non-null  float64
 1   return_2m   360752 non-null  float64
 2   return_3m   360752 non-null  float64
 3   return_6m   360752 non-null  float64
 4   return_9m   360752 non-null  float64
 5   return_12m  360752 non-null  float64
 6   Mkt-RF      316640 non-null  float64
 7   SMB         316640 non-null  float64
 8   HML         316640 non-null  float64
 9   RMW         316640 non-null  float64
 10  CMA         316640 non-null  float64
dtypes: float64(11)
memory usage: 39.8+ MB

Impute Mean for Missing Factor Betas

data.loc[:, factors] = data.groupby('ticker')[factors].apply(lambda x: x.fillna(x.mean()))
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   return_1m   360752 non-null  float64
 1   return_2m   360752 non-null  float64
 2   return_3m   360752 non-null  float64
 3   return_6m   360752 non-null  float64
 4   return_9m   360752 non-null  float64
 5   return_12m  360752 non-null  float64
 6   Mkt-RF      360752 non-null  float64
 7   SMB         360752 non-null  float64
 8   HML         360752 non-null  float64
 9   RMW         360752 non-null  float64
 10  CMA         360752 non-null  float64
dtypes: float64(11)
memory usage: 39.8+ MB

Momentum Factors

Momentum represents the speed or velocity in which prices change in a publicly traded security. Momentum is caluclated by taking the return of the equal weighted average of the 30% highest performing stocks minus the return of the equal weighted average of the 30% lowest performing stocks.

Here, we create multiple momentum factors based on different lag periods and an additional momentum factor based on the difference between the 12 month and 3 month returns.

for lag in [2,3,6,9,12]:
    # for each value in the loop, calculate new column
    # momentum is computed as the difference between the return for that lag period 
    # and the reutrn for the most recent month 
    data[f'momentum_{lag}'] = data[f'return_{lag}m'].sub(data.return_1m)
    
data[f'momentum_3_12'] = data[f'return_12m'].sub(data.return_3m)

Date Indicators

# date indicators 

dates = data.index.get_level_values('date')
data['year'] = dates.year 
data['month'] = dates.month

Lagged Returns

# to use lagged values as input variables or features associated with the current observations 
# we use the shift function to move historical returns up to the current period 

for t in range(1,7):
    data[f'return_1m_t-{t}'] = data.groupby(level='ticker').return_1m.shift(t)

data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 25 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 return_1m      360752 non-null  float64
 return_2m      360752 non-null  float64
 return_3m      360752 non-null  float64
 return_6m      360752 non-null  float64
 return_9m      360752 non-null  float64
 return_12m     360752 non-null  float64
 Mkt-RF         360752 non-null  float64
 SMB            360752 non-null  float64
 HML            360752 non-null  float64
 RMW            360752 non-null  float64
CMA            360752 non-null  float64
momentum_2     360752 non-null  float64
momentum_3     360752 non-null  float64
momentum_6     360752 non-null  float64
momentum_9     360752 non-null  float64
momentum_12    360752 non-null  float64
momentum_3_12  360752 non-null  float64
year           360752 non-null  int64  
month          360752 non-null  int64  
return_1m_t-1  358914 non-null  float64
return_1m_t-2  357076 non-null  float64
return_1m_t-3  355238 non-null  float64
return_1m_t-4  353400 non-null  float64
return_1m_t-5  351562 non-null  float64
return_1m_t-6  349724 non-null  float64
dtypes: float64(23), int64(2)
memory usage: 78.3+ MB

Target: Holding Period Returns

Holding period returns are the total return an investor earns or loses from holding an investment over a specific period of time. \(HPR = \frac{(endingValue - beginningValue + income)}{beginningValue}*100\%\)

Use the normalized period returns computed previously and shift them back to align them with the current financial features

for t in [1,2,3,6,12]:
    data[f'target_{t}m'] = data.groupby(level='ticker')[f'return_{t}m'].shift(-t)

cols = ['target_1m',
        'target_2m',
        'target_3m', 
        'return_1m',
        'return_2m',
        'return_3m',
        'return_1m_t-1',
        'return_1m_t-2',
        'return_1m_t-3']

data[cols].dropna().sort_index().head(10)

		target_1m	target_2m	target_3m	return_1m	return_2m	return_3m	return_1m_t-1	return_1m_t-2	return_1m_t-3
ticker	date
A	2001-04-30	-0.140220	-0.087246	-0.098192	0.269444	0.040966	-0.105747	-0.146389	-0.329564	-0.003653
	2001-05-31	-0.031008	-0.076414	-0.075527	-0.140220	0.044721	-0.023317	0.269444	-0.146389	-0.329564
	2001-06-30	-0.119692	-0.097014	-0.155847	-0.031008	-0.087246	0.018842	-0.140220	0.269444	-0.146389
	2001-07-31	-0.073750	-0.173364	-0.080114	-0.119692	-0.076414	-0.098192	-0.031008	-0.140220	0.269444
	2001-08-31	-0.262264	-0.083279	0.009593	-0.073750	-0.097014	-0.075527	-0.119692	-0.031008	-0.140220
	2001-09-30	0.139130	0.181052	0.134010	-0.262264	-0.173364	-0.155847	-0.073750	-0.119692	-0.031008
	2001-10-31	0.224517	0.131458	0.108697	0.139130	-0.083279	-0.080114	-0.262264	-0.073750	-0.119692
	2001-11-30	0.045471	0.054962	0.045340	0.224517	0.181052	0.009593	0.139130	-0.262264	-0.073750
	2001-12-31	0.064539	0.045275	0.070347	0.045471	0.131458	0.134010	0.224517	0.139130	-0.262264
	2002-01-31	0.026359	0.073264	-0.003306	0.064539	0.054962	0.108697	0.045471	0.224517	0.139130

data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 30 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 return_1m      360752 non-null  float64
 return_2m      360752 non-null  float64
 return_3m      360752 non-null  float64
 return_6m      360752 non-null  float64
 return_9m      360752 non-null  float64
 return_12m     360752 non-null  float64
 Mkt-RF         360752 non-null  float64
 SMB            360752 non-null  float64
 HML            360752 non-null  float64
 RMW            360752 non-null  float64
CMA            360752 non-null  float64
momentum_2     360752 non-null  float64
momentum_3     360752 non-null  float64
momentum_6     360752 non-null  float64
momentum_9     360752 non-null  float64
momentum_12    360752 non-null  float64
momentum_3_12  360752 non-null  float64
year           360752 non-null  int64  
month          360752 non-null  int64  
return_1m_t-1  358914 non-null  float64
return_1m_t-2  357076 non-null  float64
return_1m_t-3  355238 non-null  float64
return_1m_t-4  353400 non-null  float64
return_1m_t-5  351562 non-null  float64
return_1m_t-6  349724 non-null  float64
target_1m      358914 non-null  float64
target_2m      357076 non-null  float64
target_3m      355238 non-null  float64
target_6m      349724 non-null  float64
target_12m     338696 non-null  float64
dtypes: float64(28), int64(2)
memory usage: 92.1+ MB

Create Age Proxy

Here, we use quintiles of the IPO year as a proxy for company age. This means dividing a set of IPOs into five groups, or quintiles, based on the year in wihch each company went public. Each quintile represents a subset of companies that went public during a specific range of years.

data = (data
        .join(pd.qcut(stocks.ipoyear, q=5, labels=list(range(1, 6)))
              .astype(float)
              .fillna(0)
              .astype(int)
              .to_frame('age')))
data.age = data.age.fillna(-1)

Create Dynamic Size Proxy

stocks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2412 entries, A to ZUMZ
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   marketcap  2407 non-null   float64
 1   ipoyear    1065 non-null   float64
 2   sector     2372 non-null   object 
dtypes: float64(2), object(1)
memory usage: 139.9+ KB

size_factor = (monthly_prices
               .loc[data.index.get_level_values('date').unique(),
                    data.index.get_level_values('ticker').unique()]
               .sort_index(ascending=False)
               .pct_change()
               .fillna(0)
               .add(1)
               .cumprod())
size_factor.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 207 entries, 2018-03-31 to 2001-01-31
Columns: 1838 entries, A to ZUMZ
dtypes: float64(1838)
memory usage: 2.9 MB

Create Size Indicator as Declines per Period

msize = (size_factor
         .mul(stocks
              .loc[size_factor.columns, 'marketcap'])).dropna(axis=1, how='all')

data['msize'] = (msize
                 .apply(lambda x: pd.qcut(x, q=10, labels=list(range(1, 11)))
                        .astype(int), axis=1)
                 .stack()
                 .swaplevel())
data.msize = data.msize.fillna(-1)

Combine Data

data = data.join(stocks[['sector']])
data.sector = data.sector.fillna('Unknown')

data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 360752 entries, ('A', Timestamp('2001-01-31 00:00:00', freq='M')) to ('ZUMZ', Timestamp('2018-03-31 00:00:00', freq='M'))
Data columns (total 33 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 return_1m      360752 non-null  float64
 return_2m      360752 non-null  float64
 return_3m      360752 non-null  float64
 return_6m      360752 non-null  float64
 return_9m      360752 non-null  float64
 return_12m     360752 non-null  float64
 Mkt-RF         360752 non-null  float64
 SMB            360752 non-null  float64
 HML            360752 non-null  float64
 RMW            360752 non-null  float64
CMA            360752 non-null  float64
momentum_2     360752 non-null  float64
momentum_3     360752 non-null  float64
momentum_6     360752 non-null  float64
momentum_9     360752 non-null  float64
momentum_12    360752 non-null  float64
momentum_3_12  360752 non-null  float64
year           360752 non-null  int64  
month          360752 non-null  int64  
return_1m_t-1  358914 non-null  float64
return_1m_t-2  357076 non-null  float64
return_1m_t-3  355238 non-null  float64
return_1m_t-4  353400 non-null  float64
return_1m_t-5  351562 non-null  float64
return_1m_t-6  349724 non-null  float64
target_1m      358914 non-null  float64
target_2m      357076 non-null  float64
target_3m      355238 non-null  float64
target_6m      349724 non-null  float64
target_12m     338696 non-null  float64
age            360752 non-null  int64  
msize          360752 non-null  float64
sector         360752 non-null  object 
dtypes: float64(29), int64(3), object(1)
memory usage: 100.4+ MB

Store Data

store the data into an HDF file for later use

with pd.HDFStore('engineered_features.h5') as store:
    store.put('engineered_features', data.sort_index().loc[idx[:, :datetime(2024, 2, 28)], :])
    print(store.info())

<class 'pandas.io.pytables.HDFStore'>
File path: engineered_features.h5
/engineered_features            frame        (shape->[360752,33])