matplotlib - Python Q-Q and P-P plot of two distributions of unequal length -


i not sure best/most statistically sound way accomplish want is, trying take distribution of p-values , compare larger distribution of p-values created permuting original data. working small p-values, comparing log10 of p-values.

i have been trying figure out general way compare 2 arrays similar values unequal lengths. want scipy.qqplot(dataset1, dataset2), doesn't exist, q-q plot compares distribution established distribution (this question has been asked r also: https://stats.stackexchange.com/questions/12392/how-to-compare-two-datasets-with-q-q-plot-using-ggplot2).

essentially amounts comparing 2 histograms. can use np.linspace force exact same bins each distribution:

bins = 100 mx = max(np.max(vector1), np.max(vector2)) mn = min(np.min(vector2), np.max(vector2)) boundaries = np.linspace(mn, mx, bins, endpoint=true) labels = [(boundaries[i]+boundaries[i+1])/2 in range(len(boundaries)-1)] 

i can use these boundaries , labels make 2 histograms, weighted length of original vectors. easiest thing just use few bins , plot them histograms on same axis, in question:

however, want more q-q plot, , want use lot of bins, can see small deviations 1-to-1 line. problem plotting 2 histograms, this:

histogram_example

the 2 plots right on top of each other, can't see anything.

so want figure out, how compare these 2 histograms while maintaining bin labels. can plot 2 against each other scatter graph, ends being indexed bin frequency:

definitely wrong

what want, compare 2 histograms, or make q-q plot of differences, cannot come statistically sound way of doing this. can find no methods allow me make q-q plot 2 datasets instead of 1 dataset , built in distribution, , can't find way of plotting 2 distributions of unequal length against each other.

for reference, here 2 histograms went creating plot, can see extremely similar:

histograms

i know there must way of doing this, because seems obvious, new kind of thing, , relatively new scipy, pandas, , statsmodels also.

i intentionally have not provided example distribution here, because wasn't sure how make minimal set of arrays non-normally distributed , captured trying do; plus point able 2 overlapping unequal-length arrays.

what want know right/best way approach problem in python in statistically sound way? there way of creating distribution permuted data used statsmodels or scipy q-q plot? there way compare 2 histograms visually already? there way of making probability plots don't know about?


edit: trying cumulative , manual q-q plots

thanks @user333700's answer, figured out how create manual qq plot data, , cumulative probability plot. created plots using data overlapping min/max following distributions:

manufactured distributions

qq plot:

q = np.linspace(0, 100, 101) fig, ax = plt.subplots() ax.scatter(np.percentile(ytest, q), np.percentile(xtest, q)) 

qqplot

so works simple data, cumulative plot similar:

# pick bins x = ytest y = xtest boundaries = sorted(x)[::round(len(x)/bins)+1] labels = [(boundaries[i]+boundaries[i+1])/2 in range(len(boundaries)-1)]  # bin 2 series equal bins xb = pd.cut(x, bins=boundaries, labels=labels) yb = pd.cut(y, bins=boundaries, labels=labels)  # value counts each bin , sort bin xhist = xb.value_counts().sort_index(ascending=true)/len(xb) yhist = yb.value_counts().sort_index(ascending=true)/len(yb)  # make cumulative ser in [xhist, yhist]:     ttl = 0     idx, val in ser.iteritems():         ttl += val         ser.loc[idx] = ttl  # plot fig, ax = plt.subplots(figsize=(6,6)) ax.scatter(xhist, yhist) plt.show() 

cumulative plot

going actual skewed data (where 2 distributions extremely similar in every way except lengths) , adding 1-to-1 line, two:

plots real data

so both work, great, , cumulative probability plot shows quite there no large difference in data, q-q plot shows there small difference in tail.

in terms of statistical tests, scipy has 2 sample kolmogorov-smirnov test continuous variables. binned histogram data can used chisquare test. scipy.stats has k-sample anderson-darling test.

for plotting:

the equivalent of probability plot 2 histograms plot cumulative frequencies 2 samples, i.e. cumulative probabilities on each axis corresponding bin boundaries.

statsmodels has qq-plot 2 sample comparison, assumes sample sizes same. if sample sizes different, quantiles need computed same probabilities. https://github.com/statsmodels/statsmodels/issues/2896 https://github.com/statsmodels/statsmodels/pull/3169 (i don't remember status of is.)


Comments

Popular posts from this blog

c# - Update a combobox from a presenter (MVP) -

How to understand 2 main() functions after using uftrace to profile the C++ program? -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -