numpy - Matrix completion in Python -
say have matrix:
> import numpy nap > = np.random.random((5,5)) array([[ 0.28164485, 0.76200749, 0.59324211, 0.15201506, 0.74084168], [ 0.83572213, 0.63735993, 0.28039542, 0.19191284, 0.48419414], [ 0.99967476, 0.8029097 , 0.53140614, 0.24026153, 0.94805153], [ 0.92478 , 0.43488547, 0.76320656, 0.39969956, 0.46490674], [ 0.83315135, 0.94781119, 0.80455425, 0.46291229, 0.70498372]]) and punch holes in np.nan, e.g.:
> a[(1,4,0,3),(2,4,2,0)] = np.nan; array([[ 0.80327707, 0.87722234, nan, 0.94463778, 0.78089194], [ 0.90584284, 0.18348667, nan, 0.82401826, 0.42947815], [ 0.05913957, 0.15512961, 0.08328608, 0.97636309, 0.84573433], [ nan, 0.30120861, 0.46829231, 0.52358888, 0.89510461], [ 0.19877877, 0.99423591, 0.17236892, 0.88059185, nan ]]) i fill-in nan entries using information rest of entries of matrix. example using average value of column nan entries occur.
more generally, there libraries in python matrix completion ? (e.g. along lines of candes & recht's convex optimization method).
background:
this problem appears in machine learning. example when working missing features in classification/regression or in collaborative filtering (e.g. see netflix problem on wikipedia , here)
if install latest scikit-learn, version 0.14a1, can use shiny new imputer class:
>>> sklearn.preprocessing import imputer >>> imp = imputer(strategy="mean") >>> = np.random.random((5,5)) >>> a[(1,4,0,3),(2,4,2,0)] = np.nan >>> array([[ 0.77473361, 0.62987193, nan, 0.11367791, 0.17633671], [ 0.68555944, 0.54680378, nan, 0.64186838, 0.15563309], [ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022], [ nan, 0.54097945, 0.30680838, 0.82303869, 0.22784574], [ 0.21223024, 0.06426663, 0.34254093, 0.22115931, nan]]) >>> = imp.fit_transform(a) >>> array([[ 0.77473361, 0.62987193, 0.24346087, 0.11367791, 0.17633671], [ 0.68555944, 0.54680378, 0.24346087, 0.64186838, 0.15563309], [ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022], [ 0.51259188, 0.54097945, 0.30680838, 0.82303869, 0.22784574], [ 0.21223024, 0.06426663, 0.34254093, 0.22115931, 0.30317394]]) after this, can use imp.transform same transformation other data, using mean imp learned a. imputers tie scikit-learn pipeline objects can use them in classification or regression pipelines.
if want wait stable release, 0.14 should out next week.
full disclosure: i'm scikit-learn core developer
Comments
Post a Comment