Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned writes with multiple columns creates wrong directory structure if child output columns is not in same order #8663

Open
ayushi-agarwal opened this issue Feb 5, 2025 · 1 comment
Assignees
Labels
bug Something isn't working triage

Comments

@ayushi-agarwal
Copy link
Contributor

ayushi-agarwal commented Feb 5, 2025

Backend

VL (Velox)

Bug description

val data = Seq(
      ("b1", 1, "a1"),
      ("b2", 2, "a2"),
      ("b1", 3, "a1"),
      ("b2", 4, "a2")
    )

    // Define schema explicitly
    val schema = StructType(Seq(
      StructField("b", StringType, nullable = false),
      StructField("c", IntegerType, nullable = false),
      StructField("a", StringType, nullable = false),
    ))

    val rdd = spark.sparkContext.parallelize(data).map {
      case (b, c, a) => Row(b, c, a)
    }

    val df = spark.createDataFrame(rdd, schema)

    // Write DataFrame as Parquet with partitioning
    df.write
      .format("parquet")
      .partitionBy("a", "b")  // Partition by columns a and b
      .mode("overwrite")      // Overwrite if output exists
      .save("file:///tmp/partitioned_output")  

This creates directory structure as
b=b1/a=a1
b=b2/a=a2

instead of
a=a1/b=b1
a=a2/b=b2

@zhouyuan @JkSelf @rui-mo

Spark version

None

Spark configurations

spark - 3.5.1

System information

No response

Relevant logs

@ayushi-agarwal ayushi-agarwal added bug Something isn't working triage labels Feb 5, 2025
@ayushi-agarwal ayushi-agarwal changed the title Partitioned writes with multiple columns creates wrong partitions if child output columns is not in same order Partitioned writes with multiple columns creates wrong directory structure if child output columns is not in same order Feb 5, 2025
@JkSelf
Copy link
Contributor

JkSelf commented Feb 7, 2025

@ayushi-agarwal @FelixYBW I will look into this issue after I return to work on February 10th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants