Added dotplot - show all datapoints for a category #226

BioTurboNick · 2019-04-17T16:05:47Z

There's a trend in science to show all individual datapoints when something like a boxplot would be used. This dotplot is designed to allow individual data points to be overlaid onto a boxplot. Violin plots, while showing the overall shape of a distribution, may be deceptive for sparse points.

Currently, it uses a window around each point (based on quantiles) and randomization to spread the points out to keep the point density the same across the plot. May be extended later to other forms (no-overlap).

using StatsPlots
x = mod.(rand(Int, 5000), 5) .+ 1
y = randn(5000)
dotplot(x,y, marker=(3, stroke(0)))
boxplot!(x,y, opacity=0.5)

A future addition could allow raincloud plots to be created: https://micahallen.org/2018/03/15/introducing-raincloud-plots/

BioTurboNick · 2019-04-17T18:28:06Z

Added a widthwindow argument that allows the area over which density is assessed to change. Default is 0.1 of the distance between q1 and q5, valid range [0,1]. Value of 1 creates a uniform width scatter.

daschw

I really like how the plots look and I think they are a nice alternative to violin plots!
However, I'm not quite sure about the name. On the one hand, there's this: https://en.wikipedia.org/wiki/Dot_plot_(statistics), on the other hand, somehow it jsut sounds like another term for scatter plot. I don't have a better suggestion, though.
I'm also not sure if it is a problem that plots look different for the same data due to the random horizontal distribution of the points. Again, I can't think of a better implementation.
I'd rather add it to StatsPlots, than not having it.
Maybe @piever or @mkborregaard have a different opinion.
Another idea would be to add it as a style option for violin plots.

src/dotplot.jl

piever · 2019-04-18T10:17:06Z

However, I'm not quite sure about the name.

I'm not an expert on the violin, boxplot & co. nomenclature, but isn't this very similar to the beeswarm plot from #61? Maybe it could also be called beeswarm. I wonder whether this could have a side attribute (pretty much like violin) to show points only on one side, so that users could combine it with a violin on the other side and achieve the same plots as in #61 (comment)

I'm also not sure if it is a problem that plots look different for the same data due to the random horizontal distribution of the points. Again, I can't think of a better implementation.

I think ggplot2 in R also has some jitter option, maybe we could check how they do it (if they accept that the same data could give different plots, or fix some random seed or some other solution I can't think of).

BioTurboNick · 2019-04-18T11:57:49Z

Ahh, that's the name of that plot, beeswarm. I had no idea and so couldn't easily find it. Take that for what it's worth :-). Searching dotplot for images shows everything but the beeswarm layout.

#61 is definitely going beyond my immediate goals, but I do think the efforts should be combined. I'll see how I can help; perhaps a simpler version using jitter alone (seems to be the term in use) can be added as a step to the more complex one. Being able to specify (or retrieve?) a seed might be useful in case one finds a horizontal distribution they like...

(I'm not a huge fan of the original beeswarm because it can have weird spikes that IMO distract from the distribution, but at the same time benefits from showing all points unambiguously).

mkborregaard · 2019-04-18T17:40:15Z

Thanks for this @BioTurboNick . So, I think this is a beeswarm plot + a boxplot. I think it would definitely be nice to combine this with #61 - the code I posted in a comment creates a beeswarm plot which I personally find superior to any other implementation I've seen. It does have some spikes, but that is the result from wanting to show the dots like this.

As for this, I think the ideomatic way to do this in StatsPlots would be

beeswarm(myx, myy)
boxplot!(myx, myy)

We could have a recipe that combines these - in that case IMHO it should be named for the components, eg like scatterlines, so beebox or sth like that.

@BioTurboNick would you consider trying to merge your beeswarm implementation with the implementation I posted? Or do you actually prefer the jittered version?

BioTurboNick · 2019-04-18T20:35:46Z

@mkborregaard - Yeah, I'd be down to tackle that. It'll be a few weeks, trying to get a paper out and have a vacation.

BioTurboNick · 2019-06-03T09:00:33Z

I've looked at the Wilkinson paper on dot plots and I'm persuaded that "dot plot" is the proper, original, general term for this type of plot. Beeswarm is a particular recent variant. "Strip plot" is another synonym for a 1-d scatter. Perhaps these could be mapped to an underlying dotplot.

I added a non-displaced version and a more refined jittered version based on the violin plot. mode argument set to :none (default) is the first; :densityjittered for my original; violinjittered for the violin version. This is mainly experimental use.

I think the version based on violin is a bit better and I'll likely replace my original version.

As to a version that guarantees no overlap, I looked a bit into the code from #61 and it needs some upgrading to Julia 1+, and I'm not familiar with the old conventions. However, I think that could be something added later, while the jittered version may be completed now. Especially since to do it properly would require knowledge of marker size, which I understand will be coming to Plots in the future?

I also thought about the random seed issue, but I can't think of a good way to present it. However, with low numbers of points I don't believe there would be too much value in choosing a particular visual distribution. And if the distribution gets dense enough, using a violin plot, or the non-overlap version, might be a better choice.

So, my question is what do I need to do to complete a release-able version of the jittering code?

src/dotplot.jl

BioTurboNick · 2019-06-03T16:05:47Z

I had a question about the code at the top for when only y is provided. It's taken from boxplot. The comment says if only y is provided, then x will be UnitRange 1:length(y). The seems to be incorrect or incomplete.

The behavior of the code appears to be that if y is a 2-dimensional vector, x will be UnitRange 1:size(y, 2); if y is a vector of vectors, x will be UnitRange 1:length(y); if y is a vector of numbers, x will be just 1. Can we adjust this comment to match?

daschw · 2019-06-03T18:13:47Z

Can we adjust this comment to match?

Sure, feel free to add a better comment.

BioTurboNick · 2019-06-03T19:26:37Z

Here's a figure comparing the three modes, now specified by mode = none, mode = :uniform, and mode = :density.

Thinking through the possible variants, we could have:

None
Uniform with jitter
Uniform with no overlap (not yet implemented)
Kernel density with jitter
Kernel density with no overlap (not yet implemented)
Beeswarm (from WIP: Beeswarm plot #61 proposal)

If these are all implemented in this one recipe, how should they be accessed?

Could do: :none, :uniform, :density, :beeswarm; uniform and density take additional argument of allowoverlap = true or jitter = true, depending on what would be best for "no overlap" (more uniform spread, or jittered but points are redrawn to avoid overlaps?).

Thoughts?

mkborregaard · 2019-06-03T22:21:11Z

I like your thinking there

mkborregaard · 2019-06-03T22:21:47Z

(the conflict is the REQUIRE file - maybe just delete that and add the dep to the Project.toml instead?

BioTurboNick · 2019-06-04T02:21:54Z

Guess they conflict either way.

mkborregaard · 2019-06-04T05:52:51Z

Yeah - do you know how to do a local rebase?

BioTurboNick · 2019-06-04T10:55:51Z

@mkborregaard - I think so, if I know what the goal is?

mkborregaard · 2019-06-04T11:17:38Z

The thing is that while this PR has been open, we've abandoned REQUIRE files and started using Project.toml files instead. But this PR points to an older commit on master, so when you add a project file it would overwrite the existing one.
So by doing a local rebase, you would address all conflicts manually and have the PR point towards the current master. It's easiest to show in an image:

mkborregaard · 2019-06-04T11:18:57Z

As you can see your groupspacingfix PR is also based on the old master, which is what makes the commit tree look a bit non-linear.

BioTurboNick · 2019-06-04T19:10:58Z

Ah, I see. I'll figure it out, thanks!

mkborregaard · 2019-06-04T19:42:56Z

I can warmly recommend gitkraken instead of fiddling around with the command line. It's just rebase->resolve->force push. Might be a good idea to have a backup branch pointing to this commit before attempting the rebase, in case something goes wrong.

…ons dependency

BioTurboNick · 2019-06-04T21:25:36Z

Oh awesome tool, thanks!

mkborregaard · 2019-06-05T05:21:33Z

You don't have to commit the manifest file - just delete it. It's not necessary for packages.

daschw reviewed Apr 18, 2019

View reviewed changes

src/dotplot.jl Outdated Show resolved Hide resolved

src/dotplot.jl Outdated Show resolved Hide resolved

daschw requested changes Jun 3, 2019

View reviewed changes

src/dotplot.jl Outdated Show resolved Hide resolved

BioTurboNick added 9 commits June 4, 2019 17:20

Added dotplot

660d1f6

Added widthwindow attribute

2ceabca

New test modes, including one based on violin

798de52

Fixed DataFrames issue

285c062

Removed no-overlap vestige for now

73af83b

Simplified code, renamed modes, made density default, add Interpolati…

480f05f

…ons dependency

Adjusted comment

4913374

Updated comment, grouping logic

c1eda3f

Reverted requirements, added to toml

0ec797a

BioTurboNick added 4 commits June 4, 2019 18:08

Added documentation to README

21f34c2

Added updated graphics

34413d4

Added note about mode = :none

1ddf676

Trying to fix Manifest.toml...

b27d17a

Deleted Manifest.toml

3c845cf

daschw approved these changes Jul 8, 2019

View reviewed changes

daschw merged commit 7d3e0cc into JuliaPlots:master Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added dotplot - show all datapoints for a category #226

Added dotplot - show all datapoints for a category #226

BioTurboNick commented Apr 17, 2019 •

edited

Loading

BioTurboNick commented Apr 17, 2019

daschw left a comment

piever commented Apr 18, 2019

BioTurboNick commented Apr 18, 2019 •

edited

Loading

mkborregaard commented Apr 18, 2019

BioTurboNick commented Apr 18, 2019

BioTurboNick commented Jun 3, 2019 •

edited

Loading

BioTurboNick commented Jun 3, 2019

daschw commented Jun 3, 2019

BioTurboNick commented Jun 3, 2019 •

edited

Loading

mkborregaard commented Jun 3, 2019

mkborregaard commented Jun 3, 2019

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 4, 2019

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 4, 2019

mkborregaard commented Jun 4, 2019 •

edited

Loading

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 4, 2019 •

edited

Loading

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 5, 2019

Added dotplot - show all datapoints for a category #226

Added dotplot - show all datapoints for a category #226

Conversation

BioTurboNick commented Apr 17, 2019 • edited Loading

BioTurboNick commented Apr 17, 2019

daschw left a comment

Choose a reason for hiding this comment

piever commented Apr 18, 2019

BioTurboNick commented Apr 18, 2019 • edited Loading

mkborregaard commented Apr 18, 2019

BioTurboNick commented Apr 18, 2019

BioTurboNick commented Jun 3, 2019 • edited Loading

BioTurboNick commented Jun 3, 2019

daschw commented Jun 3, 2019

BioTurboNick commented Jun 3, 2019 • edited Loading

mkborregaard commented Jun 3, 2019

mkborregaard commented Jun 3, 2019

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 4, 2019

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 4, 2019

mkborregaard commented Jun 4, 2019 • edited Loading

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 4, 2019 • edited Loading

BioTurboNick commented Jun 4, 2019

mkborregaard commented Jun 5, 2019

BioTurboNick commented Apr 17, 2019 •

edited

Loading

BioTurboNick commented Apr 18, 2019 •

edited

Loading

BioTurboNick commented Jun 3, 2019 •

edited

Loading

BioTurboNick commented Jun 3, 2019 •

edited

Loading

mkborregaard commented Jun 4, 2019 •

edited

Loading

mkborregaard commented Jun 4, 2019 •

edited

Loading