如何使用由2列合并而成的字段执行查询?


我正在使用Java Spark库构建一系列的分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码。

Dataset<Row> dataset = spark.read().json("local/foods.json");
        dataset.createOrReplaceTempView("cs_food");

List<GenericAnalyticsEntry> menu_distribution= spark
        .sql(" ****REQUESTED QUERY ****")
        .toJavaRDD()
        .map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
        .map(GenericAnalyticsEntry::of)
        .collect();

writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);

我要查找的查询基于以下结构:

+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE       | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza      | Spaghetti   | 11/02/2017 | TRUE       |
+------------+-------------+------------+------------+
| Lasagna    | Pizza       | 12/02/2017 | TRUE       |
+------------+-------------+------------+------------+
| Spaghetti  | Spaghetti   | 13/02/2017 | FALSE      |
+------------+-------------+------------+------------+
| Pizza      | Spaghetti   | 14/02/2017 | TRUE       |
+------------+-------------+------------+------------+
| Spaghetti  | Lasagna     | 15/02/2017 | FALSE      |
+------------+-------------+------------+------------+
| Pork       | Mozzarella  | 16/02/2017 | FALSE      |
+------------+-------------+------------+------------+
| Lasagna    | Mozzarella  | 17/02/2017 | FALSE      |
+------------+-------------+------------+------------+

我如何从上面写的代码中实现这个(写在下面)输出?

+------------+--------------------+----------------------+
| FOODS      | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza      | 2                  | 1                    |
+------------+--------------------+----------------------+
| Lasagna    | 2                  | 1                    |
+------------+--------------------+----------------------+
| Spaghetti  | 2                  | 3                    |
+------------+--------------------+----------------------+
| Mozzarella | 0                  | 2                    |
+------------+--------------------+----------------------+
| Pork       | 1                  | 0                    |
+------------+--------------------+----------------------+

我当然试着自己找出一个解决方案,但在我的尝试中没有运气,我可能是错的,但我需要这样的东西:

"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"

转载请注明出处:http://www.cxyjjj.com/article/20230526/1613676.html