Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: refactor #14

Merged
merged 1 commit into from
Mar 23, 2023
Merged

*: refactor #14

merged 1 commit into from
Mar 23, 2023

Conversation

asubiotto
Copy link
Member

This commit refactors all querying logic into a new Querier struct. Since all queries are more or less dependent on each other, the execution logic is simplified to run the queries one at a time, with the option to add concurrent Queriers in the future if we want to increase load. This also exposes a single "query-interval" flag to make it easier for the user to choose a single interval in which to execute queries.

Other bits and bobs were also improved:

  • Termination has been simplified
  • Expiring maps have been removed. Although it was a good approach to store state needed by future queries, it is simpler and more memory efficient to just store the last result for each query. Since queries are now run in the same goroutine one after another, there is also no need for synchronization.
  • Queries are no longer run conditionally. For example, we would avoid running 15 minute merges if QueryRange didn't return a series with enough data. Anecdotally, this seemed broken and it was implemented in a leaky way since we would always run a QueryRange and store the latency as valid even if the data QueryRange executed on was less than the expected range. This simplifies querying logic.

@asubiotto
Copy link
Member Author

Here is an example of the new logging messages:

2023/03/21 10:02:08 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_PPROF,over=15m0s): took 782.871125ms
2023/03/21 10:02:09 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_FLAMEGRAPH_UNSPECIFIED,over=15m0s): took 908.05025ms
2023/03/21 10:02:10 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_FLAMEGRAPH_TABLE,over=15m0s): took 825.031083ms
2023/03/21 10:02:11 labels: took 1.130625ms and got 2 results
2023/03/21 10:02:11 values(label=job): took 940.083µs and got 2 results
2023/03/21 10:02:12 profile types: took 274.913083ms and got 7 types
2023/03/21 10:02:12 range(type=memory:alloc_objects:count:space:bytes,over=15m0s): took 441.7145ms and got 2 series
2023/03/21 10:02:13 range(type=memory:alloc_objects:count:space:bytes,over=168h0m0s): took 444.302ms and got 2 series
2023/03/21 10:02:13 single(series=memory:alloc_objects:count:space:bytes{instance="demo.parca.dev:443", job="demo"}, reportType=REPORT_TYPE_PPROF): took 391.051334ms
2023/03/21 10:02:13 single(series=memory:alloc_objects:count:space:bytes{instance="demo.parca.dev:443", job="demo"}, reportType=REPORT_TYPE_FLAMEGRAPH_UNSPECIFIED): took 489.131041ms
^C2023/03/21 10:02:13 querier: stopping
2023/03/21 10:02:13 single(series=memory:alloc_objects:count:space:bytes{instance="demo.parca.dev:443", job="demo"}, reportType=REPORT_TYPE_FLAMEGRAPH_TABLE): failed to make request: canceled: Post "http://localhost:7070/parca.query.v1alpha1.QueryService/Query": context canceled
2023/03/21 10:02:13 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_PPROF,over=15m0s): failed to make request: canceled: context canceled
2023/03/21 10:02:13 querier: stopped
2023/03/21 10:02:13 terminated: received signal interrupt

This is also an improvement: sometimes we'd get errors and had no idea what query they were coming from.

@asubiotto asubiotto force-pushed the alfonso-refactor branch 2 times, most recently from c6f40d8 to aade067 Compare March 21, 2023 11:21
@asubiotto
Copy link
Member Author

Friendly ping

Copy link
Member

@metalmatze metalmatze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version of parca-load tried to figure out data and then start querying randomized data from a Parca instance.

This version of parca-load implements a different approach, but it definitely suits us better for now. 👍

Left a bunch of comments in the review.
Great work!

querier.go Outdated
log.Println("values: no labels to query")
return
}
label := q.labels[len(q.labels)/2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always query values for the same label name if the labels are stable. We should have some randomness added here, such that every run we query slightly different data in the system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm going to add in an rng. The only problem is that we could report metrics for different queries. In this case it doesn't matter but what about merging cpu vs memory profiles? Not sure if that should belong on the same time series. Maybe we should do a query per value and have those be different labels in time series?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fair point. I think just randomness has served us well up until now, but we can always change that.

querier.go Show resolved Hide resolved
querier.go Outdated
return
}

profileType := q.profileTypes[len(q.profileTypes)/2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. We should make this configurable or randomized.

}
}

func (q *Querier) querySingle(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need querySingle anymore. We have it available in Parca and Polar Signals Cloud, however, the UI is never using the API anymore. We should just remove it from parca-load.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the UI display a single profile or get a profile to diff? I think this is still a very important query to have, as it highlighted deficiencies in object storage filtering recently and I think we want to add it to our OKRs next quarter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using this old API the UI now does a merge request where the start and end timestamps are the same. That way we only need one API in the future but still can request individual profiles for a timestamp. This old querySingle was mostly a shim for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 I'll open a separate issue for this too

querier.go Outdated
return
}

profileType := q.profileTypes[len(q.profileTypes)/2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the profileType should use the same profileType as the previous QueryRange. Then we can depend on the start- and end-timestamps to figure out what to merge profiles to query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I would like to discuss with you. I think that we don't need to rely on timestamps returned by QueryRange.
We issue, for example, 15 minute query ranges on data that does not span the whole 15 minutes (e.g. say we've only collected 5 minutes of data). This query is considered successful and its latency is reported through the metrics. Why not treat merges the same way? It reduces complexity. And yes, the latency might be a lot better when we have less data but I think that's fine. It should be easily attributed to data size.
Additionally, I think it's better to issue a merge query that might not have the full time range than silently not run merge queries. If you look at pyrra for our cloud project for example, there is no consistency in how often merge queries are run.

querier.go Outdated Show resolved Hide resolved
sync/expire.go Outdated Show resolved Hide resolved
This commit refactors all querying logic into a new Querier struct. Since all
queries are more or less dependent on each other, the execution logic is
simplified to run the queries one at a time, with the option to add concurrent
Queriers in the future if we want to increase load. This also exposes a single
"query-interval" flag to make it easier for the user to choose a single
interval in which to execute queries.

Other bits and bobs were also improved:
- Termination has been simplified
- Expiring maps have been removed. Although it was a good approach to store
state needed by future queries, it is simpler and more memory efficient to
just store the last result for each query. Since queries are now run in the
same goroutine one after another, there is also no need for synchronization.
- Queries are no longer run conditionally. For example, we would avoid running
15 minute merges if QueryRange didn't return a series with enough data.
Anecdotally, this seemed broken and it was implemented in a leaky way since
we would always run a QueryRange and store the latency as valid even if the
data QueryRange executed on was less than the expected range. This simplifies
querying logic.
@asubiotto
Copy link
Member Author

Updated with the discussed changes

@asubiotto asubiotto merged commit 045bca0 into main Mar 23, 2023
@asubiotto asubiotto deleted the alfonso-refactor branch March 23, 2023 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants