Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dotplot - show all datapoints for a category #226

Merged
merged 14 commits into from
Jul 8, 2019
Merged

Added dotplot - show all datapoints for a category #226

merged 14 commits into from
Jul 8, 2019

Conversation

BioTurboNick
Copy link
Member

@BioTurboNick BioTurboNick commented Apr 17, 2019

There's a trend in science to show all individual datapoints when something like a boxplot would be used. This dotplot is designed to allow individual data points to be overlaid onto a boxplot. Violin plots, while showing the overall shape of a distribution, may be deceptive for sparse points.

Currently, it uses a window around each point (based on quantiles) and randomization to spread the points out to keep the point density the same across the plot. May be extended later to other forms (no-overlap).

using StatsPlots
x = mod.(rand(Int, 5000), 5) .+ 1
y = randn(5000)
dotplot(x,y, marker=(3, stroke(0)))
boxplot!(x,y, opacity=0.5)

dotplot1

A future addition could allow raincloud plots to be created: https://micahallen.org/2018/03/15/introducing-raincloud-plots/

@BioTurboNick
Copy link
Member Author

Added a widthwindow argument that allows the area over which density is assessed to change. Default is 0.1 of the distance between q1 and q5, valid range [0,1]. Value of 1 creates a uniform width scatter.

Copy link
Member

@daschw daschw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how the plots look and I think they are a nice alternative to violin plots!
However, I'm not quite sure about the name. On the one hand, there's this: https://en.wikipedia.org/wiki/Dot_plot_(statistics), on the other hand, somehow it jsut sounds like another term for scatter plot. I don't have a better suggestion, though.
I'm also not sure if it is a problem that plots look different for the same data due to the random horizontal distribution of the points. Again, I can't think of a better implementation.
I'd rather add it to StatsPlots, than not having it.
Maybe @piever or @mkborregaard have a different opinion.
Another idea would be to add it as a style option for violin plots.

src/dotplot.jl Outdated Show resolved Hide resolved
src/dotplot.jl Outdated Show resolved Hide resolved
@piever
Copy link
Member

piever commented Apr 18, 2019

However, I'm not quite sure about the name.

I'm not an expert on the violin, boxplot & co. nomenclature, but isn't this very similar to the beeswarm plot from #61? Maybe it could also be called beeswarm. I wonder whether this could have a side attribute (pretty much like violin) to show points only on one side, so that users could combine it with a violin on the other side and achieve the same plots as in #61 (comment)

I'm also not sure if it is a problem that plots look different for the same data due to the random horizontal distribution of the points. Again, I can't think of a better implementation.

I think ggplot2 in R also has some jitter option, maybe we could check how they do it (if they accept that the same data could give different plots, or fix some random seed or some other solution I can't think of).

@BioTurboNick
Copy link
Member Author

BioTurboNick commented Apr 18, 2019

Ahh, that's the name of that plot, beeswarm. I had no idea and so couldn't easily find it. Take that for what it's worth :-). Searching dotplot for images shows everything but the beeswarm layout.

#61 is definitely going beyond my immediate goals, but I do think the efforts should be combined. I'll see how I can help; perhaps a simpler version using jitter alone (seems to be the term in use) can be added as a step to the more complex one. Being able to specify (or retrieve?) a seed might be useful in case one finds a horizontal distribution they like...

(I'm not a huge fan of the original beeswarm because it can have weird spikes that IMO distract from the distribution, but at the same time benefits from showing all points unambiguously).

@mkborregaard
Copy link
Member

Thanks for this @BioTurboNick . So, I think this is a beeswarm plot + a boxplot. I think it would definitely be nice to combine this with #61 - the code I posted in a comment creates a beeswarm plot which I personally find superior to any other implementation I've seen. It does have some spikes, but that is the result from wanting to show the dots like this.

As for this, I think the ideomatic way to do this in StatsPlots would be

beeswarm(myx, myy)
boxplot!(myx, myy)

We could have a recipe that combines these - in that case IMHO it should be named for the components, eg like scatterlines, so beebox or sth like that.

@BioTurboNick would you consider trying to merge your beeswarm implementation with the implementation I posted? Or do you actually prefer the jittered version?

@BioTurboNick
Copy link
Member Author

@mkborregaard - Yeah, I'd be down to tackle that. It'll be a few weeks, trying to get a paper out and have a vacation.

@BioTurboNick
Copy link
Member Author

BioTurboNick commented Jun 3, 2019

I've looked at the Wilkinson paper on dot plots and I'm persuaded that "dot plot" is the proper, original, general term for this type of plot. Beeswarm is a particular recent variant. "Strip plot" is another synonym for a 1-d scatter. Perhaps these could be mapped to an underlying dotplot.

I added a non-displaced version and a more refined jittered version based on the violin plot. mode argument set to :none (default) is the first; :densityjittered for my original; violinjittered for the violin version. This is mainly experimental use.

I think the version based on violin is a bit better and I'll likely replace my original version.

As to a version that guarantees no overlap, I looked a bit into the code from #61 and it needs some upgrading to Julia 1+, and I'm not familiar with the old conventions. However, I think that could be something added later, while the jittered version may be completed now. Especially since to do it properly would require knowledge of marker size, which I understand will be coming to Plots in the future?

I also thought about the random seed issue, but I can't think of a good way to present it. However, with low numbers of points I don't believe there would be too much value in choosing a particular visual distribution. And if the distribution gets dense enough, using a violin plot, or the non-overlap version, might be a better choice.

So, my question is what do I need to do to complete a release-able version of the jittering code?

src/dotplot.jl Outdated Show resolved Hide resolved
@BioTurboNick
Copy link
Member Author

I had a question about the code at the top for when only y is provided. It's taken from boxplot. The comment says if only y is provided, then x will be UnitRange 1:length(y). The seems to be incorrect or incomplete.

The behavior of the code appears to be that if y is a 2-dimensional vector, x will be UnitRange 1:size(y, 2); if y is a vector of vectors, x will be UnitRange 1:length(y); if y is a vector of numbers, x will be just 1. Can we adjust this comment to match?

@daschw
Copy link
Member

daschw commented Jun 3, 2019

Can we adjust this comment to match?

Sure, feel free to add a better comment.

@BioTurboNick
Copy link
Member Author

BioTurboNick commented Jun 3, 2019

Here's a figure comparing the three modes, now specified by mode = none, mode = :uniform, and mode = :density.

compare

Thinking through the possible variants, we could have:

  • None
  • Uniform with jitter
  • Uniform with no overlap (not yet implemented)
  • Kernel density with jitter
  • Kernel density with no overlap (not yet implemented)
  • Beeswarm (from WIP: Beeswarm plot #61 proposal)

If these are all implemented in this one recipe, how should they be accessed?

Could do: :none, :uniform, :density, :beeswarm; uniform and density take additional argument of allowoverlap = true or jitter = true, depending on what would be best for "no overlap" (more uniform spread, or jittered but points are redrawn to avoid overlaps?).

Thoughts?

@mkborregaard
Copy link
Member

I like your thinking there

@mkborregaard
Copy link
Member

(the conflict is the REQUIRE file - maybe just delete that and add the dep to the Project.toml instead?

@BioTurboNick
Copy link
Member Author

Guess they conflict either way.

@mkborregaard
Copy link
Member

Yeah - do you know how to do a local rebase?

@BioTurboNick
Copy link
Member Author

@mkborregaard - I think so, if I know what the goal is?

@mkborregaard
Copy link
Member

The thing is that while this PR has been open, we've abandoned REQUIRE files and started using Project.toml files instead. But this PR points to an older commit on master, so when you add a project file it would overwrite the existing one.
So by doing a local rebase, you would address all conflicts manually and have the PR point towards the current master. It's easiest to show in an image:
Skærmbillede 2019-06-04 kl  13 16 40

@mkborregaard
Copy link
Member

mkborregaard commented Jun 4, 2019

As you can see your groupspacingfix PR is also based on the old master, which is what makes the commit tree look a bit non-linear.

@BioTurboNick
Copy link
Member Author

Ah, I see. I'll figure it out, thanks!

@mkborregaard
Copy link
Member

mkborregaard commented Jun 4, 2019

I can warmly recommend gitkraken instead of fiddling around with the command line. It's just rebase->resolve->force push. Might be a good idea to have a backup branch pointing to this commit before attempting the rebase, in case something goes wrong.

@BioTurboNick
Copy link
Member Author

Oh awesome tool, thanks!

@mkborregaard
Copy link
Member

You don't have to commit the manifest file - just delete it. It's not necessary for packages.

@daschw daschw merged commit 7d3e0cc into JuliaPlots:master Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants