Data transform, where should it be done?

We’re facing a significant challenge with a dataset of 135 million objects, requiring numerous transformations like filtering and adding new columns. We’re contemplating the best approach:

  1. Performing transformations in the data lineage with PySpark, although this may result in loss of some pure data along the way.
  2. Using Quiver, although materialization doesn’t allow for column name changes.
  3. Leveraging TypeScript functions and integrating them into Workshop.
    Any insights or recommendations on the most effective method would be greatly appreciated.

You’ll need to provide more details for us to offer a helpful response:

  • What is the scenario in which you would lose data with Python Transforms? This isn’t self-evident.
  • In what context do you need to change column names? Is this for display purposes only? Or are you seeking to change column names in either the backing dataset or the object type?
  • What is the workflow you’re trying to enable here?

Taylor, Iw I’ll keep it simple. In financial reporting we make a lot of use of Pivot tables these do not show the full reportage because of ‘computational limitations’. Can this limitation be solved by turning on more CPUs, or do I need tighter filtering.

Well, Contour has pivot tables and those scale quite well, so perhaps those would work. But again, it’s hard to be helpful without more details.

There’s no “turn on more CPUs” for Quiver, Functions, etc. That does apply to Code Workspaces or Lightweight Transforms, but it sounds like you’re not using either of those today.