Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Developper documentation relation postgres <--> Ceph #2658

Open
VannTen opened this issue Jul 11, 2022 · 15 comments
Open

Developper documentation relation postgres <--> Ceph #2658

VannTen opened this issue Jul 11, 2022 · 15 comments
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.

Comments

@VannTen
Copy link
Member

VannTen commented Jul 11, 2022

It's not very clear from the code or the documentation what Ceph is used for:

I can see here that we store task results,
here
how to access it, but there should a high level description of it's role,
comparable to the postgres schema that we can generate with generate-schema.

Currently it's a bit hard to get a definite idea of what is inside (and
consequently, what should be deleted, see #2657)

Also, is it only used as an S3 store ? In that case, why reference Ceph
particularly ?

/kind documentation

@VannTen VannTen added the kind/bug Categorizes issue or PR as related to a bug. label Jul 11, 2022
@VannTen VannTen changed the title Developper documentation relation postres <--> Ceph Developper documentation relation postgres <--> Ceph Jul 11, 2022
@VannTen
Copy link
Member Author

VannTen commented Jul 13, 2022

@thoth-station/devs I appreciate if anyone has some insight or pointers regarding this.

I see a number of purge_* methods in thoth/storages/graph/postgres.py with ceph.delete call inside. Does that mean that deleting data from ceph is already taken care of with any given state for postgres ?

@VannTen
Copy link
Member Author

VannTen commented Jul 13, 2022

/cc @harshad16

@harshad16
Copy link
Member

harshad16 commented Jul 15, 2022

Lets gather what kinda details would help everyone.
For start, lets try to understand the ceph and postgres database with current problem statment in hand i.e deleting existing package index:

Ceph is document storage, we tend to keep a lot of documents stored in our ceph services based on various components. for example, if we run any execution of solver, we tend to keep its results in ceph services and sync the result in Postgres db with help of storages module.
The document is stored in ceph , for us to replicate the result to postgres db ever.
Other examples could be: Adviser run, package-extract run.

List of these kinda result can be reference from here: https://github.com/thoth-station/storages/blob/master/thoth/storages/__init__.py#L20-L42
The noticeable action here would be that these result stored in ceph services are mostly a component run.

In the case of the python package index: we won't be having a specific document for each index register in ceph at least.
we do have a table in Postgres db.

Hoping we understood what kinda result gets into ceph, based on above comment.
Lets try to understand what to remove from ceph service, when purging something from postgres db.

As i mentioned these docs in ceph services are kept to sync them to postgres db if ever needed in future.
So when we remove a group case of something, we try to remove it from ceph as well, so it doesnt get sync accidentally again.
for example if we deleted solver runs for a specific python-os version, we tend to purge docs related to it.

As we won't have documents directly depending on the index.
we should check for indirect relation:

  • one method could be to check through postgres relation
    or
  • as all our call are via POST API Call, we can check for dependency there.

we could find that solver result doc keeps track of index.
So we should work towards it, checking if sync logic consider index, if it does, we should purge solver docs along with index deletion.
This could be little tricky , so we require further discussion, if needed.

Additional answers:

Also, is it only used as an S3 store ? In that case, why reference Ceph particularly ?

Ceph service is an open source storage service, which uses a similar S3 API Call, so in our documentation sometimes we reference s3 call or s3 store. However, Ceph is a service deployed and being used, as it calls are also s3 it would show up in various places.

@VannTen
Copy link
Member Author

VannTen commented Jul 15, 2022 via email

@VannTen
Copy link
Member Author

VannTen commented Jul 18, 2022

@mayaCostantini Any thoughts ?

@harshad16
Copy link
Member

Let's see if I can summarize what I understand, to see if I really do understand. I'll see if I can work on a PR to add that in docs after that. We store documents in Ceph. Those are the results of various kind of operations. Those documents are original data (They can't be reconstructed from the db postgres). Postgres references those documents.

Yes original result data, cant be reconstructed from db postgres, only reconstruction via re-running the operations again.

Follow ups questions: - I'm not sure to understand what's going on in the sync process. If something exists in Ceph, it will be referenced/created in postgres ?

Not true for all sync, some of them are designed in that way for example:
graph-refresh component schedule the package solver which are missing in the postgres db, so it would try to re-sync the document if not in the postgres db from ceph document.

  • Can we map one document in ceph to one entity only in postgres ? In other words, do we have a many-to-one relation between ceph documents and postgres entries ? (or another relation, or does that depends ?)

The map would be more of many-to-many,
Postgres has tables and ceph db has multiple directories (should have been different buckets, however, we use one bucket but a different directory in it for different operations).
for example: adviser result and solver result are two different documents, and they would be used in different places in postgres db tables.

  • Do the documents in ceph references back to postrgres entries ? - Which one of them is the single source of truth, postgres or ceph ? Or do they each participate in it ?

ceph doesnt reference back to postgres, its other way around,
postgres reference ceph data via document-ids

The connotation of both is different, so saying one of it is signle source of truth would be right.
For the devs of data service, the postgres db would be source of truth as the application is designed on the tables.
for the devs of the data aggregation, the ceph store would be source of truth.
so we have to maintain the sync between both for great results.

Yeah, I see what Ceph is. My questions is more, do we use it exclusively through the S3 API, or do we also use other features, like CephFS or block storage ? In the first case, we might drop references to Ceph in docs and in the code and simply works with an S3 API, which could be backed by any service providing that S3 API (the fact that it's backed by Ceph would be an operational detail).

we use it through s3 api or package with support s3 , don't know what these packages have underlying in their architecture.
please feel free to update the docs, we can discuss the specifics of the docs in the review.

@VannTen
Copy link
Member Author

VannTen commented Jul 18, 2022 via email

@mayaCostantini
Copy link
Contributor

@VannTen I think Harshad provided a great explanation, I don't see any more details to add that could be useful. Thanks @harshad16 !

@VannTen
Copy link
Member Author

VannTen commented Jul 19, 2022

related: thoth-station/thoth-application#2539

@goern
Copy link
Member

goern commented Jul 19, 2022

is this something we can extract/summarize out in to the docs?

@goern goern added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 19, 2022
@VannTen
Copy link
Member Author

VannTen commented Jul 20, 2022 via email

@VannTen
Copy link
Member Author

VannTen commented Sep 5, 2022

/remove-kind bug
/kind documentation
/priority important-longtem

Related (closely) : #2691

@sesheta sesheta added kind/documentation Categorizes issue or PR as related to documentation. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 5, 2022
@sesheta
Copy link
Member

sesheta commented Sep 5, 2022

@VannTen: The label(s) priority/important-longtem cannot be applied, because the repository doesn't have them.

In response to this:

/remove-kind bug
/kind documentation
/priority important-longtem

Related (closely) : #2691

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@VannTen
Copy link
Member Author

VannTen commented Sep 5, 2022

/priority important-longterm

@sesheta sesheta added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Sep 5, 2022
@harshad16 harshad16 moved this to 🆕 New in Planning Board Oct 7, 2022
@VannTen
Copy link
Member Author

VannTen commented Jan 30, 2023

/sig stack-guidance
/remove-priority important-soon
also, see #2767

@sesheta sesheta added sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

5 participants