*: refactor #14

asubiotto · 2023-03-21T09:11:38Z

This commit refactors all querying logic into a new Querier struct. Since all queries are more or less dependent on each other, the execution logic is simplified to run the queries one at a time, with the option to add concurrent Queriers in the future if we want to increase load. This also exposes a single "query-interval" flag to make it easier for the user to choose a single interval in which to execute queries.

Other bits and bobs were also improved:

Termination has been simplified
Expiring maps have been removed. Although it was a good approach to store state needed by future queries, it is simpler and more memory efficient to just store the last result for each query. Since queries are now run in the same goroutine one after another, there is also no need for synchronization.
Queries are no longer run conditionally. For example, we would avoid running 15 minute merges if QueryRange didn't return a series with enough data. Anecdotally, this seemed broken and it was implemented in a leaky way since we would always run a QueryRange and store the latency as valid even if the data QueryRange executed on was less than the expected range. This simplifies querying logic.

asubiotto · 2023-03-21T09:18:11Z

Here is an example of the new logging messages:

2023/03/21 10:02:08 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_PPROF,over=15m0s): took 782.871125ms
2023/03/21 10:02:09 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_FLAMEGRAPH_UNSPECIFIED,over=15m0s): took 908.05025ms
2023/03/21 10:02:10 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_FLAMEGRAPH_TABLE,over=15m0s): took 825.031083ms
2023/03/21 10:02:11 labels: took 1.130625ms and got 2 results
2023/03/21 10:02:11 values(label=job): took 940.083µs and got 2 results
2023/03/21 10:02:12 profile types: took 274.913083ms and got 7 types
2023/03/21 10:02:12 range(type=memory:alloc_objects:count:space:bytes,over=15m0s): took 441.7145ms and got 2 series
2023/03/21 10:02:13 range(type=memory:alloc_objects:count:space:bytes,over=168h0m0s): took 444.302ms and got 2 series
2023/03/21 10:02:13 single(series=memory:alloc_objects:count:space:bytes{instance="demo.parca.dev:443", job="demo"}, reportType=REPORT_TYPE_PPROF): took 391.051334ms
2023/03/21 10:02:13 single(series=memory:alloc_objects:count:space:bytes{instance="demo.parca.dev:443", job="demo"}, reportType=REPORT_TYPE_FLAMEGRAPH_UNSPECIFIED): took 489.131041ms
^C2023/03/21 10:02:13 querier: stopping
2023/03/21 10:02:13 single(series=memory:alloc_objects:count:space:bytes{instance="demo.parca.dev:443", job="demo"}, reportType=REPORT_TYPE_FLAMEGRAPH_TABLE): failed to make request: canceled: Post "http://localhost:7070/parca.query.v1alpha1.QueryService/Query": context canceled
2023/03/21 10:02:13 merge(type=memory:alloc_objects:count:space:bytes,reportType=REPORT_TYPE_PPROF,over=15m0s): failed to make request: canceled: context canceled
2023/03/21 10:02:13 querier: stopped
2023/03/21 10:02:13 terminated: received signal interrupt

This is also an improvement: sometimes we'd get errors and had no idea what query they were coming from.

asubiotto · 2023-03-22T09:01:31Z

Friendly ping

metalmatze

The old version of parca-load tried to figure out data and then start querying randomized data from a Parca instance.

This version of parca-load implements a different approach, but it definitely suits us better for now. 👍

Left a bunch of comments in the review.
Great work!

metalmatze · 2023-03-22T14:30:45Z

querier.go

+		log.Println("values: no labels to query")
+		return
+	}
+	label := q.labels[len(q.labels)/2]


This will always query values for the same label name if the labels are stable. We should have some randomness added here, such that every run we query slightly different data in the system.

Agreed, I'm going to add in an rng. The only problem is that we could report metrics for different queries. In this case it doesn't matter but what about merging cpu vs memory profiles? Not sure if that should belong on the same time series. Maybe we should do a query per value and have those be different labels in time series?

Yes, fair point. I think just randomness has served us well up until now, but we can always change that.

querier.go

metalmatze · 2023-03-22T14:34:02Z

querier.go

+		return
+	}
+
+	profileType := q.profileTypes[len(q.profileTypes)/2]


Same as above. We should make this configurable or randomized.

metalmatze · 2023-03-22T14:37:52Z

querier.go

+	}
+}
+
+func (q *Querier) querySingle(ctx context.Context) {


I don't think we need querySingle anymore. We have it available in Parca and Polar Signals Cloud, however, the UI is never using the API anymore. We should just remove it from parca-load.

How does the UI display a single profile or get a profile to diff? I think this is still a very important query to have, as it highlighted deficiencies in object storage filtering recently and I think we want to add it to our OKRs next quarter.

Instead of using this old API the UI now does a merge request where the start and end timestamps are the same. That way we only need one API in the future but still can request individual profiles for a timestamp. This old querySingle was mostly a shim for that.

👍🏼 I'll open a separate issue for this too

metalmatze · 2023-03-22T14:39:17Z

querier.go

+		return
+	}
+
+	profileType := q.profileTypes[len(q.profileTypes)/2]


Here, the profileType should use the same profileType as the previous QueryRange. Then we can depend on the start- and end-timestamps to figure out what to merge profiles to query.

This is something that I would like to discuss with you. I think that we don't need to rely on timestamps returned by QueryRange.
We issue, for example, 15 minute query ranges on data that does not span the whole 15 minutes (e.g. say we've only collected 5 minutes of data). This query is considered successful and its latency is reported through the metrics. Why not treat merges the same way? It reduces complexity. And yes, the latency might be a lot better when we have less data but I think that's fine. It should be easily attributed to data size.
Additionally, I think it's better to issue a merge query that might not have the full time range than silently not run merge queries. If you look at pyrra for our cloud project for example, there is no consistency in how often merge queries are run.

querier.go

sync/expire.go

This commit refactors all querying logic into a new Querier struct. Since all queries are more or less dependent on each other, the execution logic is simplified to run the queries one at a time, with the option to add concurrent Queriers in the future if we want to increase load. This also exposes a single "query-interval" flag to make it easier for the user to choose a single interval in which to execute queries. Other bits and bobs were also improved: - Termination has been simplified - Expiring maps have been removed. Although it was a good approach to store state needed by future queries, it is simpler and more memory efficient to just store the last result for each query. Since queries are now run in the same goroutine one after another, there is also no need for synchronization. - Queries are no longer run conditionally. For example, we would avoid running 15 minute merges if QueryRange didn't return a series with enough data. Anecdotally, this seemed broken and it was implemented in a leaky way since we would always run a QueryRange and store the latency as valid even if the data QueryRange executed on was less than the expected range. This simplifies querying logic.

asubiotto · 2023-03-23T16:00:26Z

Updated with the discussed changes

asubiotto requested review from metalmatze, brancz and thorfour March 21, 2023 09:11

asubiotto force-pushed the alfonso-refactor branch 2 times, most recently from c6f40d8 to aade067 Compare March 21, 2023 11:21

metalmatze reviewed Mar 22, 2023

View reviewed changes

asubiotto force-pushed the alfonso-refactor branch from aade067 to 6a35df1 Compare March 23, 2023 16:00

metalmatze mentioned this pull request Mar 23, 2023

Labels and Values request have variable start/end timestamps #15

Open

asubiotto mentioned this pull request Mar 23, 2023

Send QueryMerge with same start/end timestamps instead of QuerySingle #16

Open

metalmatze approved these changes Mar 23, 2023

View reviewed changes

asubiotto merged commit 045bca0 into main Mar 23, 2023

asubiotto deleted the alfonso-refactor branch March 23, 2023 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: refactor #14

*: refactor #14

asubiotto commented Mar 21, 2023

asubiotto commented Mar 21, 2023

asubiotto commented Mar 22, 2023

metalmatze left a comment

metalmatze Mar 22, 2023

asubiotto Mar 22, 2023

metalmatze Mar 23, 2023

metalmatze Mar 22, 2023

metalmatze Mar 22, 2023

asubiotto Mar 22, 2023

metalmatze Mar 23, 2023

asubiotto Mar 23, 2023

metalmatze Mar 22, 2023

asubiotto Mar 23, 2023

asubiotto commented Mar 23, 2023

*: refactor #14

*: refactor #14

Conversation

asubiotto commented Mar 21, 2023

asubiotto commented Mar 21, 2023

asubiotto commented Mar 22, 2023

metalmatze left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asubiotto commented Mar 23, 2023