matplotlib - Python Q-Q and P-P plot of two distributions of unequal length -
i not sure best/most statistically sound way accomplish want is, trying take distribution of p-values , compare larger distribution of p-values created permuting original data. working small p-values, comparing log10 of p-values.
i have been trying figure out general way compare 2 arrays similar values unequal lengths. want scipy.qqplot(dataset1, dataset2), doesn't exist, q-q plot compares distribution established distribution (this question has been asked r also: https://stats.stackexchange.com/questions/12392/how-to-compare-two-datasets-with-q-q-plot-using-ggplot2).
essentially amounts comparing 2 histograms. can use np.linspace force exact same bins each distribution:
bins = 100 mx = max(np.max(vector1), np.max(vector2)) mn = min(np.min(vector2), np.max(vector2)) boundaries = np.linspace(mn, mx, bins, endpoint=true) labels = [(boundaries[i]+boundaries[i+1])/2 in range(len(boundaries)-1)] i can use these boundaries , labels make 2 histograms, weighted length of original vectors. easiest thing just use few bins , plot them histograms on same axis, in question:
however, want more q-q plot, , want use lot of bins, can see small deviations 1-to-1 line. problem plotting 2 histograms, this:
the 2 plots right on top of each other, can't see anything.
so want figure out, how compare these 2 histograms while maintaining bin labels. can plot 2 against each other scatter graph, ends being indexed bin frequency:
what want, compare 2 histograms, or make q-q plot of differences, cannot come statistically sound way of doing this. can find no methods allow me make q-q plot 2 datasets instead of 1 dataset , built in distribution, , can't find way of plotting 2 distributions of unequal length against each other.
for reference, here 2 histograms went creating plot, can see extremely similar:
i know there must way of doing this, because seems obvious, new kind of thing, , relatively new scipy, pandas, , statsmodels also.
i intentionally have not provided example distribution here, because wasn't sure how make minimal set of arrays non-normally distributed , captured trying do; plus point able 2 overlapping unequal-length arrays.
what want know right/best way approach problem in python in statistically sound way? there way of creating distribution permuted data used statsmodels or scipy q-q plot? there way compare 2 histograms visually already? there way of making probability plots don't know about?
edit: trying cumulative , manual q-q plots
thanks @user333700's answer, figured out how create manual qq plot data, , cumulative probability plot. created plots using data overlapping min/max following distributions:
qq plot:
q = np.linspace(0, 100, 101) fig, ax = plt.subplots() ax.scatter(np.percentile(ytest, q), np.percentile(xtest, q)) so works simple data, cumulative plot similar:
# pick bins x = ytest y = xtest boundaries = sorted(x)[::round(len(x)/bins)+1] labels = [(boundaries[i]+boundaries[i+1])/2 in range(len(boundaries)-1)] # bin 2 series equal bins xb = pd.cut(x, bins=boundaries, labels=labels) yb = pd.cut(y, bins=boundaries, labels=labels) # value counts each bin , sort bin xhist = xb.value_counts().sort_index(ascending=true)/len(xb) yhist = yb.value_counts().sort_index(ascending=true)/len(yb) # make cumulative ser in [xhist, yhist]: ttl = 0 idx, val in ser.iteritems(): ttl += val ser.loc[idx] = ttl # plot fig, ax = plt.subplots(figsize=(6,6)) ax.scatter(xhist, yhist) plt.show() going actual skewed data (where 2 distributions extremely similar in every way except lengths) , adding 1-to-1 line, two:
so both work, great, , cumulative probability plot shows quite there no large difference in data, q-q plot shows there small difference in tail.
in terms of statistical tests, scipy has 2 sample kolmogorov-smirnov test continuous variables. binned histogram data can used chisquare test. scipy.stats has k-sample anderson-darling test.
for plotting:
the equivalent of probability plot 2 histograms plot cumulative frequencies 2 samples, i.e. cumulative probabilities on each axis corresponding bin boundaries.
statsmodels has qq-plot 2 sample comparison, assumes sample sizes same. if sample sizes different, quantiles need computed same probabilities. https://github.com/statsmodels/statsmodels/issues/2896 https://github.com/statsmodels/statsmodels/pull/3169 (i don't remember status of is.)







Comments
Post a Comment